Program 02 · Frankenstellm · Modular Sovereign Inference Runtime

A stitched intelligence. Built from operated models.

Not a single model. An assembled organism: one refined reasoning brain, a colony of small specialist organs, a structured memory spine, a verifier, and a loop that turns every failure into surgical training data.

Commercial translation — multi-model runtime + verifier + memory + fully local deployment. A single C++/CUDA binary. No cloud, no API, no telemetry.

Brain + Organs + Memory + Verifier · Black-Dog reinforcement · See the stack →
Brain Q4 7B · 83.58 tok/s Organs 8 defined · 5 in production after BD9 Doctrine organ-first, 7B fallback
Schematic
FRANKENLLM · gigachad_native ORGAN-FIRST · HUMAN-GATED

About

We do not build a bigger parrot. We assemble a body.

A single language model is powerful, but fragile. It has no stable organs, no bloodstream, no memory of its own failures, no nervous system that tells it when it has lied. Frankenstellm is our answer to that limitation.

At the top sits a refined reasoning model — the brain. Below it sits a colony of compact sub-billion-parameter specialists: organs for code skeletons, JSON repair, claim extraction, contradiction analysis, rendering, critique, cache matching, and wound repair. Around them sits a native runtime that routes tasks, executes checks, writes traces, and records what happened.

Every output passes through a verifier. Every failure becomes poison. Every success becomes food. The Black-Dog reinforcement loop updates routing conductance so the system does not repeat the same mistake forever.

This is not a prompt chain. It is a stitched runtime organism.

The organism is not finished. Its advantage is not that every organ is already strong. Its advantage is that weak organs can be found, measured, wounded, operated on, repacked, and improved.

Results frontier APIs cannot replicate

Five measured wins, not five claims.

A
Parity

HumanEval pass@1 vs PARROT (same-weights Q4 7B): 70 % / 70 %.

Single-shot, same model class, same hardware. The runtime hits the ceiling that the underlying weights allow.

source: SOVEREIGN_WIN_REPORT_V2.md · axis A
B
Repeat-learning

MBPP round 2 (same 20 problems): MONSTER 13 / 20 vs PARROT 12 / 20.

Round 1 PARROT was ahead. Round 2 MONSTER overtook because the organism wrote its own scroll between rounds. PARROT cannot do that.

source: MBPP_REPEAT_LEARNING_V1.md
C
Hologram cache

Identical-prompt second call: 860 ms → 1 ms = 860× speedup.

The second call to the exact same prompt is essentially free. An API charges full price every call, every time.

source: EXACT_REPLAY_CACHE_V1.md
D
Terminal capsules

Terminal-NanoOS-30 tasks: MONSTER 22 / 30 vs PARROT 20 / 30 (+2).

Capsule-isolated shell tasks: the runtime carries verifier and retry context that an API call doesn't get to keep.

source: TERMINAL_NANOOS_30.md
E
Acceptance integrity

Internal acceptance bench: 18 / 18 · identity 14 / 14 · leaks 0.

Architecture audit 10/10. Reproducible decode, deterministic per pack+prompt, no organ leaks across routes.

source: gigachad_acceptance_run_v14_llamacpp.json

These are the architectural wins. Parity on the canonical bench, plus three properties a hosted API cannot reproduce by design — and a clean integrity gate. The organism's value is not raw single-shot scores; it is what it accumulates between calls.

The Stack

Seven systems. One body.

01THE BRAIN TOP-LEVEL REASONING

A refined top-level reasoning model.

The brain is the high-capacity model that handles synthesis, judgment, and difficult reasoning. It is not expected to do everything. It receives work from the lower organs and is called last, not first, when the smaller specialists cannot solve the task alone.

Brain-first routing is the failure mode. If the 7B model handles every task by default, the organism is just an expensive wrapper. The brain exists to rescue, synthesize, and decide — not to answer every question before an organ gets a chance.

  • RoleTop-level synthesis, judgment, rescue when organs fail.
  • Donor lineQwen / DeepSeek-class open models — auditable weights, deployable locally.
  • Policy7B last, organs first. Calling the brain is the fallback, not the default.
Model class 7B Calls last resort Policy organ-first
Live
BRAIN · LIVESYNTHESIS LAYER

02THE ORGANS SPECIALIST COLONY

A colony of small specialist models.

Frankenstellm uses compact 0.5B-class specialists as lower organs. Each organ has one narrow job: produce code skeletons, repair structured output, extract claims, identify contradictions, render commands, critique failures, match memory forms, or patch wounded outputs.

An organ is not decorative. If it is never called, or never improves score, latency, or reliability, it is tissue for the next surgery pass. An organ audit found that five out of eight registered organs had never been called. That finding changed the routing architecture.

  • Population8 organs defined · 5 in production after BD9 (code_skeleton · triz · wound v2 · json_repair · claim_extractor) · 2 YELLOW (test_writer, cache_matcher) · 1 RED queued BD9.1 (renderer).
  • ClassSub-billion parameter models — fast, auditable, locally runnable.
  • CriterionAn uncalled organ is a dead organ. Liveness is a hard requirement.
Defined 8 organs (in spine) In production 5 (post BD9 sweep) Queued BD9.1 1 organ (renderer)
Live
ORGAN COLONY · LIVE0.5B SPECIALISTS

03THE BLACK DOG REINFORCEMENT LOOP

A loop that feeds and starves pathways.

Every route in the system receives food or poison. A successful organ chain becomes easier to choose next time — its conductance increases. A failed route loses conductance. Repeated failures are harvested into surgical training data.

The Black Dog is the system's pain memory. It is how the organism learns what not to do again. When the BD6 surgery pass applied this to the 0.5B code organ — harvesting failed traces, training a QLoRA adapter, merging, repacking, and rerunning — the benchmark score doubled. The mechanism works.

  • SignalFood on success, poison on failure — conductance updated per route.
  • MemoryConductance per route — not a global loss function. Surgical and specific.
  • OutputSurgery datasets — failed traces become QLoRA training material.
Signal food / poison Routes fed 0 Starved 0
Live
BLACK DOG · LIVECONDUCTANCE ROUTING

04MEMORY SPINE PERSISTENT RECALL

Persistent recall. Not chat history.

Frankenstellm does not rely on conversation context as memory. It uses a structured archive of indexed records, reports, decisions, and execution traces. A claim can be tied back to where it came from. A previous failure can be retrieved and used as training material.

Memory is not decoration. It is anatomical continuity. Without it, the organism forgets its wounds. The same failure repeats. The same wrong route gets chosen again. The spine is what gives the system a history it can act on.

  • RecallVolume / line / record precise — not approximate semantic search.
  • ContentsIndexed records, execution traces, decisions, failure logs, repair histories.
  • UseSource grounding, failure harvest, route memory, audit trail.
Records 0 Traces indexed Recall line-precise
Live
MEMORY SPINE · LIVEPERSISTENT ARCHIVE

05THE VERIFIER IMMUNE LAYER

A hard gate against hallucination.

The verifier is the system's immune layer. Code is compiled. JSON is parsed. Terminal tasks run in capsules. Claims require source pointers. If an answer cannot pass the relevant check, it is not treated as complete.

A model can guess. The verifier decides whether the guess survives. The default stance is suspicious. An answer that cannot be verified is not an answer — it is raw material for the wound system.

  • DefaultSuspicious. No output is accepted without passing its check.
  • ChecksCompile, parse, execute, hash, source-pointer. Task-specific.
  • OutputPass / fail / evidence. Failures route directly to the wound system.
Default suspicious Passed 0 Failed 0
Live
VERIFIER · LIVEPASS / FAIL / EVIDENCE

06THE WOUND SYSTEM FAILURE → SURGERY

Failure becomes training material.

When an organ fails, the system does not simply discard the output. It records the task, the organ response, the verifier error, the stderr, the expected behavior, and the eventual repair. These wounds become the dataset for the next surgery pass.

This is the central difference between a chatbot and an organism: failure is metabolized. The BD6 surgery pass proved this in practice — a 0.5B code organ improved from 6/100 to 13/100 on MBPP after one wound-harvest-train-repack cycle. The numbers are still small. The mechanism is real.

  • InputFailed traces — task, response, error, stderr, expected output.
  • ProcessPoison harvest → QLoRA → merge → repack. One full surgical loop.
  • OutputA stronger organ. Benchmarked before and after. No exceptions.
BD6 result MBPP 6→13 / 100 HumanEval 2→6 / 164
Live
WOUND SYSTEM · LIVEFAILURE IS METABOLIZED

07THE BODY NATIVE RUNTIME

A compiled runtime that ships the organism.

Frankenstellm is not intended to live as Python glue. The live path belongs in a compiled native runtime: routing, model loading, verification hooks, DAG traces, capsules, and memory all sit under one local body. Python may exist in the operating theatre as a temporary surgical tool. It does not become the organism.

Compile what you ship. Research tooling can use Python. Production inference, routing, verification hooks, packs, and memory belong in the native runtime. The body is what makes the organism sovereign — deployable on consumer hardware, without cloud dependency, without telemetry.

  • RuntimeC++ / CUDA / native packs — compiled, not interpreted.
  • TargetLocal consumer hardware — sovereign deployment, no cloud dependency.
  • DoctrineCompile what you ship. Python is for the operating theatre only.
Runtime C++ / CUDA Target consumer GPU Dependency none
Live
NATIVE BODY · LIVECOMPILE WHAT YOU SHIP

Selected Work

What happened when we ran the numbers.

Case 01 · BD6 Surgery

The first organ surgery loop.

We measured the raw 0.5B code organ against public coding tasks and found the truth: it was fast, but weak. On MBPP it solved 6 out of 100 tasks. On HumanEval it solved 2 out of 164. That failure was not hidden. It became surgical material.

The Black-Dog pipeline harvested the failed traces, joined them with reference solutions, trained a QLoRA adapter, merged it back into the organ, repacked into the native format, and reran the same benchmarks — with no 7B fallback allowed.

Organphys05_code_skeleton
MBPP before6 / 100
MBPP after BD613 / 100
HumanEval before2 / 164
HumanEval after BD66 / 164
7B fallback calls0
Case 02 · A/B/C Truth Table

Organ truth table: three routing modes.

Before surgery, we forced the system into three modes: A (7B only), B (0.5B organ only), C (organ-first with 7B fallback). The result was uncomfortable and useful. On MBPP: A scored 60, B scored 6, C scored 60. On HumanEval: A scored 81, B scored 2, C scored 81.

The conclusion was clear: the runtime was wired correctly, but the small organs were not yet strong enough to improve on the top brain. That finding became the reason for BD6.

MBPP — 7B only (A)60 / 100
MBPP — organ only (B)6 / 100
MBPP — organ+fallback (C)60 / 100
HumanEval — 7B only (A)81 / 164
HumanEval — organ only (B)2 / 164
ConclusionOrgans needed surgery
Case 03 · Organ Audit

The organ liveness audit.

A traffic audit showed that earlier runs were too often 7B-monolithic. In the last 500 DAG entries before rewiring, every call went through the 7B chat path. Five of eight registered 0.5B organs had never been called at all. The wound organ did not exist yet.

That audit changed the architecture. Frankenstellm now treats organ liveness as a hard requirement: if an organ is not called, logged, scored, and improved, it is not part of the body.

DAG entries audited500
7B-only routes500 / 500
Dead organs found5 / 8
Wound organ statusDid not exist
CorrectionOrgan-first rewiring + surgery
Case 04 · Proof-Carrying Execution

Answers that carry their own evidence.

Frankenstellm does not merely write shell commands or code — it can execute them inside controlled capsules, capture stdout and stderr, verify artifacts, hash outputs, and preserve a replay recipe. This turns answers into evidence.

When a route passes, the system can show exactly what ran. When it fails, the wound becomes training material. An output without a verifier trace is only text. Frankenstellm is built to return not just an answer, but the evidence that produced it.

ExecutionCapsule-based, sandboxed
Evidencestdout / stderr / hashes
ReplayPreserved per task
Failure useWound → surgery dataset

Production system · Mode C

What the full system actually does on benches.

Mode C is the production path: organ-first, with 7B top-brain fallback if the organ chain fails verification. These are 7B-class numbers — what the system delivers when allowed to use its full anatomy.

BenchmarknPassWallOrgans used7B fallback
MBPP10060 / 1005 353 sphys05_code_skeleton + 7B99
HumanEval16481 / 1648 629 sphys05_code_skeleton + 7B164
ARIZ TRIZ contradictions10088 / 100~5 s/taskphys05_triz_contradiction0
Terminal-NanoOS3022 / 3025 s/taskshell capsule + 7B

source: BENCH_CLEANUP_AND_OFFICIAL_RUN.md · BD7_TRIZ_SURGERY_FINAL.md · TERMINAL_NANOOS_30.md

Surgery evidence · Mode B

What the surgery did to the organ alone.

Mode B disables the 7B fallback so we can measure what BD6 actually changed in the 0.5B specialist. This is not the system's production performance. It is the surgical delta on the organ — proof the surgery loop has a real effect, not a marketing number.

BenchmarkBefore BD6After BD6ΔAnchorOrgan leaksFallback
MBPP6 / 10013 / 100+7 (+117 %)19/1900
HumanEval2 / 1646 / 164+4 (+200 %)19/1900
LiveCodeBench easy0 / 500 / 500 (dispatcher routes LCB to ARIZ — out of scope for this organ)00

source: BD6_POST_SURGERY_DELTA.md · MBPP_HE_3MODE_V1.md

The surgery loop:

bench failures  →  poison dataset  →  QLoRA on 0.5B  →  merge  →  repack
                →  anchor gate (19/19 required)  →  flip pack  →  re-bench
                →  KEEP if all four gates pass, otherwise REVERT.

Eight surgery passes were reverted before BD6 pass-1 was kept. Reverts are recorded as negative results, not hidden.

source: BD6_2_OVERTRAIN_DELTA.md … BD6_8D_RANK_FINAL_FREEZE.md

Anatomy

The organism, diagrammed.

A real diagram of the live runtime — top-brain Q4 7B, five production organ packs after the BD9 sweep (code_skeleton · triz · wound v2 · json_repair · claim_extractor), the BD8 critic+wound queued for retraining, the verifier as the immune layer, the Black-Dog router that updates conductance per route, and the line-addressable memory spine underneath.

TOP BRAIN · Layer 4 Physarium-7B Q4 5.55 GB · 83.58 tok/s · synthesis only called only when organs fail verification ↓ fallback path ↓ PRODUCTION · BD6 phys05_code_skeleton .planck · 988 MB BF16 MBPP B13 / 100 HE B6 / 164 LCB · anchor0 / 50 · 19 / 19 PRODUCTION · BD7 phys05_triz_contradiction .planck · 988 MB BF16 ARIZ strict88 / 100 6 fields ≥88 each fallback0 BD9 SWEEP · newly surgered · BD9.1 queued 3 GREEN · 2 YELLOW · 1 RED — BD9 sweep, 2026-05-05 phys05_json_repair ✓ phys05_claim_extr ✓ phys05_wound v2 ✓ phys05_test_writer ⚠ phys05_cache_matcher ⚠ phys05_renderer × json_repair 10/10 catalog · renderer queued BD9.1 (50+ rows or r=16) ROUTER · Black-Dog conductance store EMA per (pattern, organ_chain) selects max-conductance chain VERIFIER · immune layer hard checks JSON · code compile · exit code source-pointer (memory-anchored) → DAG entry write ← food / poison signal verifier-fail · 7B fallback MEMORY SPINE · line-addressable scrolls · hologram · DAG · volumes files305 lines58 996 addresssha256[:16] / line cache hit860 ms → 1 ms ↑ persistent state ↑ provenance

How a request flows

Seven steps, one DAG entry.

  1. 01 Dispatch. The dispatcher classifies input (regex + heuristics) into a route — json / code / ariz / claim / cache / wound / renderer / chat.
  2. 02 Router consults conductance. For the (route, organ_chain) pair, the Black-Dog router selects the chain with the best food-vs-poison ratio.
  3. 03 Lower organ runs first. A 0.5B specialised pack handles the request — typically 3–5 s on RTX 3060 Ti.
  4. 04 Verifier checks structure. JSON schema, code compile, test runs, source presence — task-specific gates.
  5. 05 Critic + wound repair. If verifier fails, wound v2 attempts in-chat repair (BD9 retained for chat path); for ARIZ JSON specifically the rescue rate is still 0 across BD8 V1–V5 — BD8 retraining queued.
  6. 06 7B top-brain synthesis. Only if step 5 also fails — exactly one 7B call.
  7. 07 DAG entry written. Final answer plus organ_chain, food, poison, conductance delta, verifier reason, fallback used — every request leaves an audit trail.

Sovereign comparison

Four axes versus same-weights PARROT.

Honest axis-by-axis from SOVEREIGN_WIN_REPORT_V2.md. Axis A is parity. Axes B / C / D are architectural advantages a frontier API cannot reproduce, because they require a local runtime with persistent state and a verifier loop.

AxisWhatResultWhat an API cannot do
AHumanEval pass@1 vs same-weights PARROT14/20 = 70 % · parity
BMBPP repeat-learning ×3 roundsround 2: 13/20 vs PARROT 12/20 · ✅ MRLAPI has no round-2-with-evidence loop
CHologram cache · identical prompt860 ms → 1 ms · 860× speedupAPI charges full price every call
DEvidence per answerDAG node + organ chain + gate resultAPI returns a string

Production state

Working — and not yet working.

Working

  • C++ runtime · native CUDA forward · single binary.
  • Production code-skeleton organ · MBPP 13/100, HE 6/164, LCB 0/50 · anchor 19/19 · frozen.
  • Production TRIZ organ · ARIZ 88/100 strict · fallback 0.
  • BD9 phys05_json_repair · 10 / 10 GREEN on production failure catalog.
  • BD9 phys05_claim_extractor · GREEN · clean structured-JSON output.
  • Production grew from 2 to 5 surgered organs in one BD9 session.
  • Q4 Physarium-7B · 83.58 tok/s production (llama.cpp) · 18.27 native default · 28.99 with DP4A flag · 5.55 GB VRAM · RTX 3060 Ti.
  • Memory spine indexed · 305 files · 58 996 lines.
  • Acceptance suite 18/18 · identity probe 14/14.
  • Hologram cache · 860 ms → 1 ms on identical input (860×).
  • Repeat-learning round 2 · MBPP-20 · 13/20 vs PARROT 12/20.
  • Terminal-NanoOS-30 · MONSTER 22/30 vs PARROT 20/30 (+2).

Not yet at production gate

  • BD8 critic + wound rescue rate · 0/n on ARIZ JSON · wound v2 retained for in-chat path · BD8 retraining queued.
  • BD9 phys05_test_writer YELLOW · pytest shape correct, semantics drift.
  • BD9 phys05_cache_matcher YELLOW · correct integer + post-answer drift.
  • BD9 phys05_renderer RED · output corrupted on free-form bash · queued BD9.1 (50+ rows or r=16).
  • Black-Dog conductance arbitration in Python harness only · C++ port queued.
  • Critic + wound for non-terminal routes.
  • Memory exact-lookup CLI · not built.
  • Memory TF-IDF semantic ranker · not built.
  • GPQA Diamond runner · gated on dataset auth.
  • SWE-bench Lite runner · gated on Phase-12 NanoOS shell.
  • BFCL 3-mode runner · partial harness.
  • Routing-field heuristic (paused).

Acceptance Integrity Ladder

Nine runs, no regression.

The 18-task curated acceptance suite is the integrity gate every change has to pass. Identity probe, organ-leaks gate, structural verifier, end-to-end JSON / code / claim / memory routes. No surgery merges if this regresses.

RunResultIdentityLeaksWallNote
v14 · llama.cpp backend18/1814/1402.99 sproduction ceiling · CURRENT_TRUTH_LEDGER §1
v15 · DP4A native17/180opt-in flag · close to 18/18 with G2.b/G3/G4 fixes
v16 · after Gap C close18/180GAP_C_KILL_FINAL.md
v17 · llamacpp after Gap C18/180
v18 · after G3 (Python compile probe)18/180verifier hardening
v19 · after holographic form replay18/180HOLOGRAPHIC_FORM_REPLAY_V1.md
v20 · after native CR18/180code-repair loop in native runtime
v21 · anchored preamble18/180identity anchor in system message
v22 · post anchor18/180stable post-9F LoRA anchor

v14 confirmed in CURRENT_TRUTH_LEDGER §1. v15 → v22 trajectory documented in HISTORY_TREE.md §5/§6; per-run JSONs preserved in master logs. Headline: every shipped change passed acceptance.

Reports

Every claim has a file behind it.

Roadmap
FRANKENLLM_ROADMAP_STATUS_V1.md — master 8-track status
Truth Ledger
CURRENT_TRUTH_LEDGER.md — single source of truth
BD7
BD7_TRIZ_SURGERY_FINAL.md — 0 → 88/100 strict 6-field JSON
BD9
BD9_JSON_REPAIR_FINAL.md — phys05_json_repair 10/10 GREEN on production failure catalog
BD9
BD9_FOUR_ORGANS_FINAL.md — four-organ sweep, production grew 2 → 5 organs
Bench
MBPP_HE_3MODE_V1.md / .json — MBPP 13/100 · HE 6/164 (Mode-B, no fallback)
Bench
SOVEREIGN_WIN_REPORT_V2.md — 4-axis honest comparison vs PARROT
Bench
SOVEREIGN_COGNITION_GAUNTLET_V1.md — RED 59/60 vs PARROT 60/60 · honest
Bench
TERMINAL_NANOOS_30.md — MONSTER 22/30 vs PARROT 20/30
Bench
EXACT_REPLAY_CACHE_V1.md — 860 ms → 1 ms cache hit
Bench
MBPP_REPEAT_LEARNING_V1.md — round 2 · MRL ≥ PARROT
Runtime
RUNTIME_ORGANISM_BENCH_V1.md — BD2 + BD4 first integration
Audit
GIGACHAD_SYSTEM_INTEGRITY_AUDIT.md — 10-layer connectivity check
Audit
CLEAN_ROOM_DOCTRINE.md — externals are autopsy specimens, never spine

All reports live in /home/pc/gigachad_native/reports/. Numbers on this page point back to them.

Principles

Five rules that govern the organism.

01

Organs must earn their place.

A specialist model that is never called is dead. A specialist that does not improve score, latency, or reliability is tissue for the next surgery pass. Organ liveness is a hard requirement, not a preference.

02

The brain is last, not first.

The top model should not handle every task by default. Lower organs attempt narrow work first. The brain synthesizes, judges, or rescues only when organs cannot carry the load. Brain-first routing is the failure mode.

03

Failure is not waste.

Every failed output is a training example waiting to be harvested. The verifier produces the wound. The Black Dog marks it as poison. Surgery turns it into a stronger organ. This cycle is the organism's immune system.

04

No proof, no claim.

An answer without a verifier trace is only text. Frankenstellm is built to return not only an output, but the evidence that produced it. An unverified output is raw material, not a result.

05

Native is the body.

Research tooling can use Python. The live organism cannot depend on it. Production inference, routing, verification hooks, packs, and memory belong in the native runtime. Compile what you ship.