Rewriting Vidya in JavaScript

We are rewriting Vidya in plain JavaScript on Deno.

This is not a decision we made lightly. The OCaml codebase works. The autograd engine is correct. The training loop runs. The model learns. We have spent six weeks building a from-scratch transformer framework in OCaml and it does everything we asked of it.

We are rewriting it anyway, because the next phase of this project demands something the current codebase cannot give us: fluency.


Why Now

The v2 architecture is designed. Five changes to the model: SwiGLU activation, Grouped Query Attention, 30 layers, a 32,000-token vocabulary, and a 2,048-token context window. Every one of these touches the forward pass, the backward pass, and the checkpoint format. The v1 model definition is effectively replaced.

If we are rewriting the model anyway, we should rewrite it in a language we can move fast in. The framework code โ€” autograd, optimizer, tokenizer โ€” carries forward in concept but not in syntax. The total rewrite is about 2,000 lines.

This is the last moment before the irreversible line. Once we start teaching the model with reinforcement learning, the weights become irreplaceable. Before that line, the code is the thing we can afford to change. After it, the weights are the thing we cannot.


Why JavaScript

We built Vidya in OCaml to understand every layer. No PyTorch, no Python, no frameworks. That goal is accomplished โ€” we understand transformers, autograd, tokenization, and training loops from first principles. The question now is not "can we build it?" but "can we maintain it, extend it, and debug it for years?"

The answer is: more easily in JavaScript.

We think in JavaScript. When we see a bug in training, we need to trace through the forward pass, identify the wrong gradient, and fix it. That debugging happens faster in a language where the syntax is invisible โ€” where we read logic, not notation. OCaml is readable, but JavaScript is native. The difference is small on any given line and large over a year of maintenance.

Claude codes better in JavaScript. This project is a collaboration between a human and an AI. Claude has seen orders of magnitude more JavaScript than OCaml. The result is fewer subtle bugs in the first draft, better API patterns, and more capacity focused on the math rather than the language. For a numeric library where a wrong sign in a gradient wastes a week of training, that matters.

Deno runs JavaScript natively. No compile step. No build system. deno run train.js and it runs. Deno's FFI calls C libraries directly โ€” we call OpenBLAS the same way we call it from OCaml, with less boilerplate:

const lib = Deno.dlopen("libopenblas.so", {
  cblas_dgemm: {
    parameters: ["i32", "i32", "i32", "i32", "i32", "f64",
                 "buffer", "i32", "buffer", "i32", "f64", "buffer", "i32"],
    result: "void"
  }
});

One declaration. No C stub file. No marshalling code. The same BLAS library, the same matrix multiply, fewer lines between our code and the silicon.


Why Not TypeScript

We considered TypeScript. It is the obvious choice for a JavaScript project in 2026 โ€” better tooling, type safety, industry standard. But Vidya is not an industry project. It is 3,000 lines of numeric code that multiplies arrays of floats.

The types add nothing. A tensor is a Float64Array, a gradient is a Float64Array, a weight matrix is a Float64Array. Writing type annotations on every function that takes a float array and returns a float array is not safety โ€” it is ceremony. The function names and variable names carry more information than the types.

// TypeScript
function matmul(
  a: { data: Float64Array; rows: number; cols: number },
  b: { data: Float64Array; rows: number; cols: number }
): { data: Float64Array; rows: number; cols: number }

// JavaScript
function matmul(a, b)

The TypeScript version is not safer. A transposed matrix has the same type as a non-transposed matrix. A wrong learning rate has the same type as a correct one. The bugs that kill training runs โ€” wrong gradient formulas, off-by-one in head slicing, incorrect advantage normalization โ€” are mathematical errors that no type system catches.

Types scale with complexity. In a web application with dozens of API endpoints, nested state, and five developers, TypeScript prevents real bugs. In a numeric library with one developer, flat data structures, and 20 functions that all do math on arrays, the type system is overhead. We would spend more time satisfying the type checker than it would ever save us in caught bugs.

Lean code, clear names. JavaScript forces you to write readable function names because the types are not there to document the interface. computeGradient, clipNorm, cosineLR โ€” the names carry the meaning. If we need type hints for editor support, JSDoc gives us that without a type system:

/** @param {{ data: Float64Array, grad: Float64Array }} tensor */
function backward(tensor) { ... }

VS Code reads JSDoc and provides full autocomplete. No TypeScript required.


Why Deno

Deno is a minimal JavaScript runtime with built-in FFI, a built-in test runner, and no configuration files. It runs .js files directly. That philosophy โ€” small tools, no ceremony, everything built in โ€” matches how we build Vidya.

We are also porting Aither to Deno for its JACK audio FFI support. One runtime for all our projects means one set of patterns, one FFI interface, one test runner. When we eventually port Vidya to Tenstorrent Blackhole, the FFI call to TT-NN looks exactly like the FFI call to OpenBLAS โ€” Deno.dlopen with a different library name.


Performance

The question everyone asks: is JavaScript fast enough for training a neural network?

The answer is: JavaScript does not train the neural network. BLAS does. Vidya's hot path is cblas_dgemm โ€” a matrix multiply running in optimized C and Fortran inside OpenBLAS. Whether the caller is OCaml, JavaScript, or Python, the matrix multiply takes the same time.

Everything outside the BLAS call โ€” softmax, activation functions, gradient accumulation, Adam optimizer updates โ€” is array arithmetic in a loop. Deno's V8 engine JIT-compiles these loops to machine code. The result is not as fast as OCaml's ahead-of-time compilation, but the difference is small:

Operation OCaml (native) Deno (V8 JIT) Impact on training step
BLAS matmul Same Same 70-80% of time
Element-wise ops ~1.0x ~1.3x slower 10-15% of time
Adam updates ~1.0x ~1.2x slower 5-10% of time
Autograd bookkeeping ~1.0x ~1.5x slower <5% of time

The non-BLAS code is roughly 20-30% of a training step. If that code runs 1.3x slower, the total training step is about 7-10% slower. On a 2-second step, that is 150 milliseconds. On a GPU step measured in milliseconds, it is invisible.

If a specific loop ever becomes a bottleneck โ€” unlikely, but possible โ€” Deno's FFI lets us drop that one function to C. Compile a .so, call it from JavaScript, done. The escape hatch exists. We do not expect to need it.


What We Test This Time

The OCaml codebase has no tests. That was fine for an experiment. It is not fine for a model we intend to keep for life.

The rewrite adds targeted tests for the things that fail silently and expensively. Not 100% coverage โ€” that would be ceremony for a 3,000-line codebase. Just the tests that catch the bugs that waste weeks.

Gradient Checks

Every autograd operation gets a numerical gradient test. The method is simple: nudge each input by a tiny epsilon, measure how the output changes, compare that to the analytical gradient from the backward pass.

function numericalGrad(fn, input, eps = 1e-5) {
  const grad = new Float64Array(input.length);
  for (let i = 0; i < input.length; i++) {
    input[i] += eps;
    const plus = fn(input);
    input[i] -= 2 * eps;
    const minus = fn(input);
    input[i] += eps;
    grad[i] = (plus - minus) / (2 * eps);
  }
  return grad;
}

One test per operation:

Operation What it catches
matmul Transposition errors, wrong accumulation
softmax Numerical instability, wrong Jacobian
SiLU / SwiGLU gate Wrong derivative formula
RMSNorm Incorrect variance computation
RoPE Wrong rotation direction, off-by-one in frequency
Cross-entropy loss Sign errors in the gradient
GQA head broadcasting Wrong head grouping in backward pass

These are the bugs that produce a model that trains to garbage over three days without any error message. A gradient check catches them in milliseconds.

Tokenizer Round-Trip

Encode a string to tokens, decode back to string, verify they match. Test edge cases: empty string, single character, unknown bytes, multi-byte UTF-8. A tokenizer bug produces silent data corruption โ€” the model trains on garbage tokens and the loss curve looks normal.

Forward Pass Smoke Test

Run a known input through the full model. Verify the output shape is correct, all values are finite (no NaN, no infinity), and the softmax sums to 1. One test, ten lines. Catches wiring errors when connecting layers.

What We Do Not Test

The training loop, checkpoint saving, file I/O, CLI arguments, logging. These break loudly and obviously โ€” a crash, a missing file, a wrong number on screen. Silent failures are the enemy. Loud failures fix themselves.

About 15 tests total. Each under 20 lines. Run in under a second with deno test. The entire test suite is smaller than a single transformer layer, and it protects months of training time.


The Rewrite Plan

The rewrite follows the dependency order of the framework:

1. tensor.js      Autograd engine โ€” all ops, forward + backward
                   (test: gradient checks for every op)

2. blas.js        Deno FFI to OpenBLAS cblas_dgemm
                   (test: multiply two known matrices, verify result)

3. bpe.js         BPE tokenizer โ€” train, encode, decode
                   (test: round-trip encode/decode)

4. model.js       v2 model definition โ€” 30 layers, SwiGLU, GQA
                   (no test needed โ€” struct definitions)

5. forward.js     Training and inference forward passes
                   (test: smoke test โ€” finite outputs, correct shapes)

6. train.js       Adam optimizer, cosine LR, gradient clipping
                   (test: one Adam step reduces loss)

7. generate.js    Text generation โ€” sampling, chat, prompted
                   (no test needed โ€” output is subjective)

8. main.js        CLI entry point
                   (no test needed โ€” wiring only)

Each file is written and tested before moving to the next. The autograd engine comes first because everything depends on it. The model definition uses the v2 architecture from the start โ€” SwiGLU, GQA, 30 layers, 32K vocabulary, 2,048-token context. There is no intermediate step of reimplementing v1 in JavaScript. We go straight to v2.


What Changes, What Stays

Changes:

Stays:

The framework is a vehicle. The model is the destination. We are switching vehicles before the road gets steep.


See also: Designing v2, A Model for Life, Burn the Stack.

Co-authored with Claude.