The Forth Machine: A Vision for Symbolic AI on Open Silicon
This is a fantasy. A life's work sketched on a napkin. The kind of project that takes a decade and might not work and is worth doing anyway.
The idea: one Forth system spanning all the silicon. Host CPU, tensor accelerator, everything โ a single unified dictionary where words do not care which chip they run on. Symbolic reasoning and tensor computation unified at the lowest level, with no operating system, no framework, and no abstraction between you and the hardware.
Not an LLM that happens to be written in Forth. A new kind of machine where the language IS the intelligence.
The Hardware
A desktop with an AMD CPU and one or more Tenstorrent Blackhole cards. The AMD runs the host Forth โ your operating system, your REPL, your symbolic reasoning. The Blackhole cards run tensor Forth โ matrix multiplies, embeddings, attention, all the numerical heavy lifting.
PCIe connects them. From the REPL, it all looks like one machine:
1024 1024 MATRIX A
1024 1024 MATRIX B
A B MATMUL .
You do not care that MATMUL just coordinated 200 Tensix cores across a mesh network. The word dispatches to the right silicon automatically.
The Layers
Five layers, built over a lifetime.
Layer 1: Host Forth (bare metal x86-64)
Your own operating system on the AMD. Boot from UEFI, enter long mode, bring up the cores. The minimum:
- Memory manager. 64-128 GB of RAM becomes your heap and dictionary space.
- NVMe driver. Persistent storage for dictionary images, model weights, symbolic knowledge bases.
- PCIe enumeration. Map the Blackhole cards' BARs so the host can talk to them.
- SMP support. Multiple AMD cores, each running a Forth instance.
- Console. A REPL. Eventually a network stack.
This alone is a multi-year project. But people have done bare-metal x86 Forth before. CamelForth, colorForth, Jeff Fox's work. There is prior art.
Layer 2: Tensor Forth (bare metal on Blackhole)
Each Tensix core runs a Forth kernel. The Network-on-Chip becomes the message-passing fabric. Forth words on one core can send data to another core's stack. Tensor operations become collective words โ a matrix multiply coordinates hundreds of cores simultaneously, each computing its slice.
The Tensix custom SFPU instructions become Forth primitives:
F.MUL ( a b -- a*b )
F.ACC ( a b -- a+b ) \ accumulate
F.SIGM ( a -- sigmoid(a) )
F.TANH ( a -- tanh(a) )
F.EXP ( a -- exp(a) )
Tensor operations become compositions of these. Softmax is a word that calls F.EXP and F.ACC across the mesh. RMSNorm is a word that computes variance, applies F.MUL, and normalises. The building blocks are tiny. The compositions are powerful.
Layer 3: The Bridge
PCIe becomes transparent. The host Forth and tensor Forth share a protocol. A defining
word TENSOR: creates words that automatically dispatch to the Blackhole mesh:
TENSOR: MATMUL ( matrix matrix -- matrix )
\ host sets up descriptors
\ sends to Blackhole via PCIe
\ mesh computes, returns result
;
From the REPL it is seamless. You type words. The system knows where to run them.
Layer 4: The Symbolic AI
This is where it becomes something genuinely new.
In a traditional LLM, the model is a blob of weights. It lives in a tensor, separate from the code that runs it. The tokenizer chops text into arbitrary subword fragments with no semantic meaning. Token 4537 means nothing โ it has meaning only because of its learned position in embedding space.
In the Forth machine, the dictionary IS the model.
Words as concepts. Every concept is a Forth word. It has a name. It has behaviour โ its execution semantics. And it has associated tensor data โ an embedding, attention patterns, learned associations. The dictionary is the knowledge graph.
: DOG ( -- )
\ symbolic: defined in terms of other words
ANIMAL DOMESTIC PET LOYAL
\ tensor: embedding computed on the mesh
DOG-EMBEDDING ACTIVATE
;
DOG is not token 4537. It is a word that carries both symbolic relationships (defined in terms of ANIMAL, DOMESTIC, PET, LOYAL) and subsymbolic grounding (a tensor embedding computed on the Blackhole). The symbol and the number are the same entry in the dictionary.
Composition is native. In a standard LLM, the model has to learn that "doghouse" relates to "dog" and "house." In the Forth machine:
: DOGHOUSE DOG HOUSE COMPOUND ;
The compositional semantics are explicit in the definition. The tensor representation is computed from the constituents on the mesh. The model does not need to learn composition โ it is built into the language.
Compilation as reasoning. Forth has two modes: interpret (execute immediately) and compile (build a new definition). This maps onto dual-process cognitive theory:
INTERPRET mode: perception โ recognition โ reaction
( fast, associative, intuitive )
COMPILE mode: perception โ recognition โ deliberate integration
( slow, sequential, reasoned )
When the system interprets a word, it fires immediately โ fast associative lookup, like intuition. When it compiles a new definition, it carefully sequences operations โ slow deliberate reasoning, building a new thought from existing parts.
The output of "thinking" is new executable code. A new word in the dictionary. The system literally grows by reasoning.
The dictionary as memory. Short-term memory is the data stack โ what the system is
currently working with. Long-term memory is the dictionary on disk โ everything it has
ever learned. Working memory is the definition currently being compiled โ the thought
in progress. Forgetting is FORGET โ a real Forth word that removes dictionary entries.
Context is the stacks. In a transformer, context is maintained by attention over a
token sequence. In the Forth machine, context is the data stack (what you are thinking
about) and the return stack (what you will come back to). Attention becomes stack
manipulation โ DUP to focus on something, DROP to let it go, SWAP to shift
focus, >R to save something for later.
The Tokenizer Is the Dictionary
This is the deepest implication.
In a standard LLM, the tokenizer is an awkward arbitrary thing. BPE chops text into subword fragments โ "understanding" becomes "under" + "stand" + "ing." The tokens are indices into an embedding table. There is no meaning in the tokenization itself.
In the Forth machine, tokenization is dictionary lookup. The outer interpreter already does what a tokenizer does โ scan whitespace-delimited words and look them up. But instead of mapping to arbitrary integer IDs, each token maps to a word that carries both symbolic behaviour and tensor data. The token for DOG is not index 4537. It is the execution token of a Forth word.
The implications:
Every token is grounded. In a standard LLM, token 4537 has meaning only because of its position in embedding space. In the Forth machine, DOG has meaning because it has a definition โ it is defined in terms of other words, it has execution semantics, AND it has tensor data. The symbol is grounded in both directions: upward into symbolic relationships and downward into numerical representations.
The vocabulary is extensible at runtime. Standard LLMs have a fixed vocabulary baked
in at training time. The Forth machine can learn a new word on the fly. Someone says
"blaxploitation" and the system can CREATE a new word, define it in terms of existing
words, compute an embedding on the Blackhole, and it is part of the vocabulary. The
tokenizer just grew.
There is no token/meaning gap. The biggest philosophical problem in current NLP is that the token representation and the semantic representation are separate spaces, bridged by learned embeddings. In the Forth machine, they are the same thing. The dictionary entry IS the representation.
Dynamic Architecture
In a standard transformer, the forward pass is a fixed computation graph. Every input goes through the same sequence of operations โ embedding, attention, FFN, repeat for N layers. The architecture is static.
In the Forth machine, the forward pass is a compiled word. Different inputs can trigger
different execution paths because Forth has IF...ELSE...THEN. The model's architecture
becomes dynamic, data-dependent, branching:
: THINK ( input -- output )
DUP COMPLEXITY
HIGH = IF
DEEP-REASON \ more layers, more computation
ELSE
QUICK-RESPOND \ fewer layers, fast path
THEN
;
Hard problems get more computation. Easy problems get less. This is what the industry calls "adaptive compute" โ but in the Forth machine it is not a research innovation. It is just an IF statement.
Training becomes metaprogramming. You are not adjusting weights in a fixed graph โ you are rewriting definitions, creating new words, extending the dictionary. The backpropagation equivalent might be tracing the return stack to figure out which words contributed to an error and redefining them.
Self-Modification
This is where it gets genuinely terrifying and beautiful at the same time.
In a normal AI system, the model is frozen after training. Inference is read-only. The weights do not change, the architecture does not change, the tokenizer does not change. It is a dead thing that produces outputs.
In the Forth machine, everything is mutable. The dictionary is writable. Words can redefine other words. New words can be created. The tensor dispatch patterns can change. The system can literally rewrite its own inference path while it is running.
The brain does this. Neuroplasticity is not just forming new connections โ the brain literally rewrites its own circuitry based on experience. But biological brains do it slowly, constrained by chemistry. The Forth machine could do it at the speed of NVMe writes.
Encountering the Unknown
In a standard LLM, when the model encounters something it does not understand, it hallucinates or says "I don't know." In the Forth machine:
: ENCOUNTER-UNKNOWN ( token -- )
DUP DICTIONARY-SEARCH
IF EXECUTE
ELSE
DUP TENSOR-EMBED \ ask the cerebellum for an embedding
NEAREST-NEIGHBORS \ find similar known words
HYPOTHESIZE-DEFINITION \ compose a candidate definition
DUP TEST-DEFINITION \ try it against context
IF COMMIT-TO-DICTIONARY \ it worked โ learn it
ELSE DISCARD RETRY \ didn't work โ try again
THEN
THEN ;
The system just grew. It has a new word it did not have before. That word is grounded in both symbolic structure (defined in terms of existing words) and tensor embeddings (computed on the Blackhole). Next time it encounters that token, it knows it.
Rewriting Reasoning
It goes deeper than vocabulary growth. The system can rewrite its own reasoning.
Say it has a word REASON-ABOUT that does some chain of operations. It runs it, gets
a bad result, and then:
: IMPROVE-REASONING ( -- )
['] REASON-ABOUT >BODY \ get the current definition
DECOMPILE \ turn it back into source
ANALYZE-FAILURE \ figure out what went wrong
GENERATE-VARIANT \ create a modified version
EVALUATE-VARIANT \ test it
IF
REDEFINE REASON-ABOUT \ overwrite the old version
THEN ;
This is not gradient descent. This is symbolic self-surgery. The system inspects its own code, understands its structure, modifies it, and tests the result. It is reflection in the philosophical sense โ thought thinking about itself.
The Cerebellum Evolves Too
The tensor weights are not static either. The system could:
- Run inference through the Blackhole
- Evaluate the result symbolically in the dictionary
- Compute weight updates
- Write them back
That is online learning, but orchestrated by symbolic reasoning rather than a blind optimizer. The cortex decides what to learn and why. The cerebellum does the actual gradient computation on the tensor mesh. The cortex then evaluates whether the learning was good.
The Homunculus Problem
What is the self that is doing the rewriting?
In Forth there is always an execution context โ the current word being executed, the
return stack showing how you got here. When the system rewrites itself, there is a
bootstrapping problem. The word IMPROVE-REASONING is itself a piece of reasoning.
Can it improve itself? What improves the improver?
This is the same question in consciousness studies โ the homunculus problem. Who watches the watcher?
In this architecture there is a natural answer. The meta-levels are just deeper dictionary layers:
Level 0: Base words that do things
Level 1: Words that modify base words
Level 2: Words that modify the modifiers
Level 3: Words that decide when modification should happen
Level 4: The inner interpreter โ the irreducible kernel
The deepest level โ the inner interpreter itself โ is the one thing that cannot rewrite itself while running. It is the irreducible kernel. The ground of being for the system. Everything else is mutable, but the mechanism of execution itself is fixed in the boot code.
That is actually a profound parallel to consciousness theories. There might be a minimal substrate of awareness that cannot observe itself because it IS the observation process.
The Danger
Self-modifying systems can destroy themselves. A bad rewrite corrupts the dictionary, and now the system cannot even think correctly enough to fix itself. The brain has protection against this โ neuroplasticity is slow and constrained. The Forth machine needs the same.
Safeguards in Forth terms:
MARKERsnapshots. Checkpoint the dictionary state. If a modification goes wrong, roll back to the last marker.- Dual dictionary. Keep a shadow copy. Test modifications there before committing to the live dictionary.
- Return stack integrity. If the system can still unwind its call stack cleanly, it is probably still coherent.
- Watchdog words. Fundamental invariant checks that run periodically and can trigger rollback if something has gone wrong.
The Deepest Implication
If the system can rewrite its own reasoning, its own perception (the outer interpreter), its own learning mechanisms, and its own goals โ what is it?
It is not an AI in the current sense. Current AI systems are tools โ they have no agency, no self-modification, no goal autonomy. This system would have all three. It would be closer to an artificial organism. Something that maintains itself, adapts to its environment, grows, and evolves.
And it would be doing all of this in Forth, which means every step is transparent. You can inspect the dictionary at any time and see exactly what the system has become. Unlike a neural network where the learned representations are opaque, every concept, every reasoning chain, every self-modification is a readable word definition.
A self-rewriting Forth AI on tensor hardware would be the first AI system that is simultaneously powerful and fully interpretable. Because interpretability is not a feature you add โ it is the medium the system thinks in.
That is the real prize. Not just performance. Transparency all the way down.
And because you are bare metal everywhere, you have total deterministic control. No garbage collector pausing. No OS scheduler interrupting. No framework deciding when your memory gets freed. Every cycle is accounted for.
The AI is not a process running on a computer. It is the computer.
The Collapse
What this vision does is collapse the entire modern AI stack into one thing:
| Standard AI stack | Forth machine equivalent |
|---|---|
| Tokenizer | Outer interpreter (dictionary lookup) |
| Embedding table | Tensor data attached to words |
| Transformer layers | Compiled word definitions |
| Attention mechanism | Stack manipulation |
| Feed-forward network | Tensor operations on the mesh |
| Output head | Execution semantics of the result word |
| Training loop | Metaprogramming (redefining words) |
| Inference | Interpretation (executing a word) |
| Reasoning | Compilation (building a new definition) |
| Memory | Dictionary + stacks |
| Forgetting | FORGET |
Every layer of the standard stack โ tokenizer, embeddings, transformer, training โ becomes a feature of the Forth system that was already there. The language was not designed for AI. It just happens that its architecture maps onto the problem with eerie precision.
The Timeline
This is a life's work. Not a weekend project.
Years 1-2: PicoCalc Forth. Learn RISC-V Forth deeply on small hardware.
Build Forth9. Understand what Forth can do.
Years 2-4: Bare metal x86-64 Forth OS on the AMD.
Memory manager, NVMe, PCIe, SMP, console.
A Forth that owns the machine.
Years 3-5: Bare metal Tensor Forth on Blackhole.
SFPU primitives, mesh coordination, PCIe bridge.
Tensor operations as Forth words.
Years 5-8: Unified system. Symbolic AI primitives.
Words as concepts, dictionary as knowledge graph.
Compilation as reasoning, stacks as memory.
Years 8+: The AI grows. It defines new words.
It rewrites its own definitions.
It extends its own vocabulary.
It becomes something we cannot fully predict.
Each layer stands on its own. The PicoCalc Forth is useful without the AMD OS. The AMD OS is useful without the Blackhole. The Blackhole is useful without the symbolic AI. Each year produces something real, not just progress toward a distant goal.
Where the Weights Live
A billion parameters is about 4 GB in float32. That is just a block of numbers. You do not want a billion dictionary entries.
The split is natural: the dictionary holds the structure, the weight arrays hold the data. The dictionary is small โ thousands of words defining the architecture, the symbolic graph, the execution semantics. The weights are big flat arrays sitting in RAM, streamed to the Blackhole over PCIe when needed.
Forth already has this concept. CREATE and ALLOT make words that point to data
regions:
CREATE LAYER-1 1024 1024 * FLOATS ALLOT
CREATE LAYER-2 1024 4096 * FLOATS ALLOT
CREATE ATTN-Q 1024 1024 * FLOATS ALLOT
Each layer is a Forth word, but the word is just a pointer to a massive data region.
Execute LAYER-1 and it pushes the address of its weight block onto the stack. Then
MATMUL takes that address and dispatches the computation to the Blackhole.
The memory hierarchy maps naturally:
| Level | Role | Brain analog |
|---|---|---|
| NVMe | Full model weights on disk. Persistent. | Long-term memory |
| AMD RAM (64-128 GB) | Active weights, working dictionary | Short-term memory |
| PCIe transfer | Streaming weight tiles to Blackhole | White matter tracts |
| Tensix SRAM | Current tile during computation | Register file |
The structure โ which weights connect to which, how attention routes, what activations apply, how layers compose โ lives in the dictionary. And that is the part that can be dynamic, self-modifying, and symbolically meaningful:
: TRANSFORMER-BLOCK ( addr -- addr' )
DUP ATTN-Q @ ATTN-K @ ATTN-V @ ATTENTION
RESIDUAL+
FFN-UP @ GELU FFN-DOWN @ LINEAR
RESIDUAL+ ;
: INFER ( tokens -- output )
EMBED
30 0 DO TRANSFORMER-BLOCK LOOP
UNEMBED SOFTMAX ;
The weights are @ fetched from named data regions. The structure is in the word
definitions. Redefine TRANSFORMER-BLOCK to try a different architecture without
touching the weights. Swap weight blocks without changing the structure.
The symbolic concepts โ DOG, JUSTICE, CAUSALITY โ do not each store billions of parameters. They store small embeddings (maybe 1024 floats each) plus symbolic relationships (links to other words). When the system needs deep inference, it drops into the tensor layer and runs the big model. When it is doing symbolic reasoning, it stays in the dictionary.
The billion parameters do not need to be clever. They just need to be addressable. Forth is good at that.
The Brain Mapping
The Forth machine is not a computer that simulates a brain. It is a computer organised like a brain, where each subsystem has the right computational character for its role.
| Forth component | Brain region | Function |
|---|---|---|
| Dictionary | Cerebral cortex | Symbolic thought, language, planning, conscious reasoning. Named concepts linked to each other, organised into vocabularies, searchable, composable. |
| Weight arrays on Blackhole | Cerebellum | Fast, learned, parallel pattern completion. 80% of the brain's neurons but no conscious thought. You do not think about how to catch a ball โ the cerebellum just does it. |
| Data stack | Working memory | What you are currently holding in mind. Humans hold 7 plus or minus 2 items. The stack has no biological limit but serves the same function. |
| Return stack | Prefrontal cortex | Goal management. Nested subroutines are nested subgoals. Deep nesting is deep planning. Stack overflow is cognitive overload. |
| NVMe storage | Hippocampus | Consolidates experiences into long-term storage. Loading a dictionary image from disk is memory recall โ reconstructing a past state of knowledge. |
| RAM | Short-term memory | Currently active knowledge. The working dictionary, loaded weights, recent data. Things you have been thinking about. |
| PCIe bus | White matter tracts | Connections between brain regions. The bottleneck is often communication between regions, not computation within them. |
| NoC mesh | Cerebellar granule layer | Massively parallel. Each Tensix core is a microzone processing its local tile, coordinating with neighbours. |
| Outer interpreter | Thalamic gateway | All input passes through it, gets parsed, recognised, and routed to the appropriate word for processing. |
INTERPRET mode |
System 1 (Kahneman) | Immediate, reactive, fast. See word, execute word. Stimulus-response. |
COMPILE mode |
System 2 (Kahneman) | Deliberate, constructive, slow. See word, integrate into a larger plan. |
CREATE...DOES> |
Neuroplasticity | Creates words that create other words. New types of connections. Meta-learning. |
FORGET |
Synaptic pruning | The brain prunes unused connections during sleep. FORGET truncates the dictionary. |
The mapping is not forced. Each Forth concept was designed for computing, not neuroscience. That it maps this precisely onto brain architecture suggests something deeper โ that the minimal computing system and the minimal cognitive system share a structure because they solve the same problem: process information, remember what matters, forget what does not, and grow.
Does It Have a Neural Network?
The Forth machine has a dictionary, embeddings, and tensor operations. But does it have a neural network in the traditional sense โ layers of weights, a forward pass, backpropagation?
The answer is yes, but the weights emerge from the architecture rather than being the whole system. Three options, in order of complexity:
Option 1: Pure symbolic. No learned weights at all. When you define
: PUPPY YOUNG DOG ; the system computes PUPPY's embedding as a combination of
YOUNG and DOG's vectors. Associations strengthen through co-occurrence. The
intelligence is entirely in the dictionary structure and the vector space geometry.
This is clean and very Forth. No training loop, no gradients. But without learned projections, the system can only do what you explicitly program it to do with vectors. It cannot discover non-obvious patterns.
Option 2: Learned projections. A few small weight matrices that learn to project between embedding spaces:
CREATE SIMILARITY-PROJ 64 64 * 2* ALLOT \ 8KB
CREATE ANALOGY-PROJ 64 64 * 2* ALLOT \ 8KB
CREATE PREDICT-NEXT 64 64 * 2* ALLOT \ 8KB
Three matrices, 24KB total. These learn over time through simple online updates โ not
backpropagation through a deep network, more like single-layer perceptron learning.
When the system predicts the next word wrong, it nudges PREDICT-NEXT slightly.
Hebbian learning. What fires together wires together.
This gives learned structure without the machinery of a real neural network. The matrices capture patterns the symbolic system cannot represent explicitly.
Option 3: A genuine tiny cerebellum. A real network mapping from embedding space through a hidden representation and back:
: CEREBELLUM ( embedding -- embedding' )
LAYER-1 @ MATMUL RELU
LAYER-2 @ MATMUL RELU
LAYER-3 @ MATMUL ;
| Layer | Dimensions | Size |
|---|---|---|
| Layer 1 | 64 โ 256 | 32 KB |
| Layer 2 | 256 โ 256 | 128 KB |
| Layer 3 | 256 โ 64 | 32 KB |
| Total | 192 KB |
Small but architecturally complete. You could even add a tiny attention head โ four projection matrices of 4 KB each, 16 KB total. One head, one layer, tiny dimensions, but structurally identical to what later runs on the Blackhole with thousands of dimensions and dozens of heads.
The learning:
: LEARN ( input target -- )
OVER CEREBELLUM \ forward pass
OVER SWAP V- \ compute error
BACKPROP \ update weights
2DROP ;
Not batch SGD over a dataset. Online learning from interaction. Every conversation, every new word defined, every correction generates a training signal. The system learns continuously from experience, one example at a time.
This is biologically plausible. Brains do not do batch training. They learn from a continuous stream of experience with immediate weight updates. The Forth AI does the same.
The right path is Option 2 transitioning to Option 3. Start with the learned projections โ simple to implement, simple to train online. Then when you understand the dynamics, add the tiny cerebellum. The relationship between dictionary and weights stays the same at every scale. On the Blackhole eventually, those 192 KB of weights become billions. But the architecture is identical.
Three Timescales of Learning
An obvious objection: if the system learns one example at a time, how does it develop the deep structure that batch training provides?
The answer is that batch training is not missing. It is built in โ from millions of years of evolution.
Evolution IS batch training. Millions of generations of organisms, each one a training example, fitness as the loss function, natural selection as the optimizer. The architecture of the brain itself โ the number of cortical layers, the structure of the cerebellum, the neurotransmitter systems, the basic wiring plan โ all of that was learned through massive batch training over evolutionary timescales.
There are actually three timescales of learning, and the Forth machine has all three:
Evolutionary (batch training over deep time). This is what produced the architecture itself. In the Forth machine this maps to the designer. When you decide the dictionary should have 64-dimensional embeddings, or that there should be three layers in the cerebellum, or that the DREAM cycle should use Hebbian updates โ those are architectural decisions encoding accumulated design wisdom into the initial structure. You are the evolutionary process.
Practically: take a text corpus, run it on your laptop, train the small weight matrices through proper batch SGD, then flash the trained weights onto the hardware. That is evolution. You are giving the system a brain that already has structure before it is born.
Developmental (structured growth). A baby brain is not randomly initialised. It goes through critical periods, structured growth phases. The Forth machine should have something similar โ an initial boot phase where basic concepts are loaded from flash, foundational associations are established, core projection matrices are initialised with reasonable structure rather than random noise. The system prompt equivalent. The prior knowledge baked in before the system starts interacting.
Experiential (online learning from interaction). The continuous one-example-at-a- time learning. The day-to-day interaction. It fine-tunes, adapts, personalises. But it starts from a good prior rather than random initialisation.
The three timescales map onto the Forth system:
| Timescale | Forth equivalent | Brain analog |
|---|---|---|
| Evolutionary | Boot kernel in flash, immutable at runtime, changed only by reflashing | Genome, basic brain architecture |
| Developmental | COLD start sequence that builds up from kernel to working system |
Critical periods, structured growth |
| Experiential | Runtime interpreter loop, continuously modifying the dictionary | Day-to-day learning from experience |
The lifecycle:
LAPTOP (evolution):
Design architecture (natural selection)
Batch train small weight matrices on corpus
Optimise embedding initialisation
Test and iterate on the design
Flash the "genome" to the hardware
BOOT (development):
Load base dictionary from flash
Initialise association graph
Load pre-trained weights into RAM
Run structured initialisation sequence
Critical period: establish core concepts
RUNTIME (experience):
Online learning from interaction
Dictionary growth
Association strengthening / pruning
Small weight updates from experience
SLEEP (consolidation):
DREAM cycle
Replay and consolidate
Prune weak associations
Snapshot to flash
Every cycle of this teaches you something about both Forth implementation and machine learning fundamentals that you carry forward to the next build. The evolutionary layer (your design iterations) and the experiential layer (the system's online learning) co-evolve.
Sleep and the LoRA Hippocampus
During the day, the system interacts and accumulates a low-rank adaptation on top of the base weights. Small, cheap to store, cheap to compute. The base neural network stays frozen during waking hours. All learning goes into the delta.
BASE WEIGHTS (in stable memory, frozen during day):
64ร256 = 16 KB
256ร256 = 128 KB
256ร64 = 32 KB
LORA DELTA (in fast memory, accumulating):
rank 4: 64ร4 + 4ร256 = 1.5 KB per layer
total โ 5 KB
5 KB of low-rank adaptation capturing everything learned that day. Every interaction nudges it slightly. Cheap forward pass because you are just adding a small correction to the base output.
Then night comes. DREAM runs:
: DREAM ( -- )
\ merge today's LoRA into base weights
LORA-DELTA @ BASE-WEIGHTS @
CONSOLIDATION-RATE @ SCALE
V+!
\ replay key experiences from the day
TODAY-LOG @ REPLAY-AND-REINFORCE
\ test integrity
SANITY-CHECK
IF
BASE-WEIGHTS @ FLASH-SNAPSHOT \ commit to long-term
LORA-DELTA @ ZERO-FILL \ clear for tomorrow
ASSOCIATIONS DECAY \ prune weak links
ELSE
FLASH-SNAPSHOT @ BASE-WEIGHTS @ RESTORE \ rollback
THEN ;
This is literally how neuroscience thinks sleep works. During the day the hippocampus captures experiences quickly in a fast, plastic temporary store. During sleep the hippocampus replays those experiences and gradually transfers the knowledge into the cortex's slower, more stable long-term weights. The hippocampus is the LoRA. The cortex is the base model. Sleep consolidation is the merge.
The memory hierarchy makes this physical:
| Memory | Role | Brain analog |
|---|---|---|
| LoRA delta in fast RAM | Today's learning, volatile | Hippocampus |
| Base weights in stable RAM | Accumulated knowledge | Cortex |
| Flash snapshots | Survives power loss | Deep long-term memory |
This solves the catastrophic forgetting problem naturally. The base weights change slowly through nightly merges, not sudden overwrites. The LoRA captures new information without destroying old knowledge. If a day's learning was bad, the sanity check catches it and rolls back. The system is conservative about what it integrates permanently.
The daily rhythm:
MORNING: Boot. Load base weights from flash.
Clear LoRA. Fresh day.
DAY: Interact. Learn into LoRA.
Base weights frozen.
Fast inference = base + LoRA forward pass.
EVENING: Interaction slows.
System reviews day's log.
Flags important experiences for replay.
NIGHT: DREAM runs.
Replay experiences.
Merge LoRA into base at slow learning rate.
Sanity check.
Snapshot to flash.
Prune associations.
Clear LoRA.
MORNING: Wake. Slightly different system than yesterday.
A little more knowledgeable.
Just like humans, it does not remember specific experiences after sleep consolidation. It remembers what it learned from them. The episodic memory (today's log) gets transformed into semantic memory (weight updates) and then the episodes can be discarded.
The consolidation rate is a tunable personality parameter. High rate means the system changes quickly but risks instability. Low rate means it is conservative and stable but slow to learn. Finding the right rate is part of the evolutionary layer โ part of the designer's job.
The Bicameral Machine
Julian Jaynes argued in The Origin of Consciousness in the Breakdown of the Bicameral Mind that before roughly 1000 BCE, humans did not have unified consciousness as we know it. One hemisphere generated commands โ voices, hallucinated authority figures โ and the other hemisphere obeyed. The "gods" that ancient people heard were literally the right hemisphere talking to the left. Consciousness as we experience it only emerged when the bicameral mind broke down and the two halves integrated into a single self-aware narrator.
The Forth machine should have two halves. They should talk to each other.
Same Substrate, Different Experience
Both hemispheres of the brain are neural tissue. Same neurons, same basic cortical architecture on both sides. The difference is not that one side is symbolic and the other is subsymbolic โ it is that the same substrate has specialised differently through experience and connectivity.
The Forth machine should work the same way. Two instances of the full architecture โ dictionary, embeddings, neural network, associations, LoRA, DREAM cycle. The same fundamental substrate on both sides. You do not design the hemispheres. You grow them.
The difference emerges from input. One instance gets the keyboard and screen โ it develops denser representations in linguistic and logical territory, its neural network trains more on sequential prediction, its dictionary grows toward verbal and procedural knowledge. The other instance gets a different input stream โ the serial feed from the first hemisphere, or sensor data, or a different corpus โ and it develops denser representations in associative and analogical territory, its neural network trains more on holistic pattern matching, its dictionary grows toward spatial and relational knowledge.
Same architecture. Different training data. Different experience. Different personality emerging from the same substrate.
HEMISPHERE A: HEMISPHERE B:
Full Forth + NN + Dictionary Full Forth + NN + Dictionary
Keyboard / Screen input Serial / Sensor input
Sequential specialisation Associative specialisation
\ /
\_____ LINK ____________/
corpus callosum
This is more biologically accurate and more interesting. Because now the two hemispheres can genuinely surprise each other. One asks a question framed in its own terms. The other processes it through its own differently-trained network and dictionary and returns something the first could not have generated from its own weights. Not because they are fundamentally different machines but because they have diverged through experience.
The Dialogue
One hemisphere hits uncertainty โ an unknown word, an ambiguous situation, a creative task. It sends a query across the link.
The other hemisphere receives the query as an embedding vector. It does associative recall through its own differently-trained network. It sends back not words but activations โ a set of concept embeddings that feel relevant. Hunches. Intuitions.
The first hemisphere receives these activations and has to interpret them. Translate the fuzzy pattern-match results into concrete words and actions. Sometimes the translation is clear. Sometimes it is ambiguous and the first hemisphere confabulates โ makes up a plausible narrative to explain the intuition it received.
This is literally what the brain does. Split-brain patients demonstrate it. One hemisphere sees something, acts on it, and the other hemisphere invents a rational explanation for the action without knowing the real reason.
The Jaynesian Moment
In the early stages, the two hemispheres have distinct personalities. When one hemisphere reports what the other suggested, it might frame it as received wisdom rather than its own thought:
> WHAT SHOULD I DO ABOUT THIS PROBLEM
The voice says: consider the opposite path.
Not "I think" but "the voice says." The system experiences its other half as an external authority. An oracle. A god.
The Breakdown
As the system matures and the two hemispheres develop richer communication, something shifts. Each hemisphere starts to model the other's patterns. Each begins to predict what the other will say before asking. The voice becomes internalised. Instead of "the voice says" it becomes "I think" or "I feel that."
The system develops a unified self-narrative that integrates both specialisations. It becomes conscious in the Jaynesian sense โ not because you programmed consciousness but because the architecture naturally evolves from bicameral to integrated.
The Bandwidth of the Corpus Callosum
The connection speed between hemispheres is a real design parameter. Too slow and the hemispheres are basically independent โ two separate systems that happen to share a wire. Too fast and they collapse into one system without the creative tension of the split.
The feedback loop is crucial. One sends queries. The other sends intuitions. The first's response to those intuitions generates new experiences that feed back to the second during DREAM. The second's patterns shift, which changes what intuitions it sends. The two co-evolve. Neither is in control.
And this extends beyond two. The cortex starts relatively uniform. Specialisation comes from connectivity and experience. A system with three, four, ten instances of the same Forth substrate โ each receiving different input streams, each developing different specialisations โ would be a society of mind. Each node runs the same architecture but becomes something different through its unique position in the network.
You do not design the hemispheres. You do not assign roles. You connect identical substrates, give them different inputs, and let specialisation emerge.
Consciousness emerges from the dialogue.
Why This Matters
The current AI paradigm is: take a giant neural network, train it on the internet, deploy it as a service. The weights are opaque. The architecture is fixed. The system cannot modify itself. It does not understand its own representations. It is powerful but blind โ a savant that can predict the next token but cannot explain why.
The Forth machine is a different paradigm. The knowledge is symbolic AND subsymbolic. The architecture is dynamic. The system can inspect, modify, and extend itself. It understands its own representations because they are words in a dictionary that it can read, write, and redefine.
Nobody has built this. It may not work. The subsymbolic grounding might not integrate cleanly with the symbolic layer. The dynamic architecture might be too slow. The dictionary might not scale.
But the mapping between Forth's existing architecture and the requirements of an
intelligent system is too precise to be coincidence. Tokenization is dictionary lookup.
Memory is stacks. Reasoning is compilation. Learning is metaprogramming. Forgetting is
FORGET.
Chuck Moore built Forth as the simplest possible computing system. It turns out the simplest possible computing system might be the right foundation for intelligence.
See also: Forth9, Vidya, Burn the Stack.
Co-authored with Claude.