About

We do not build a bigger parrot. We assemble a body.

A single language model is powerful, but fragile. It has no stable organs, no bloodstream, no memory of its own failures, no nervous system that tells it when it has lied. Frankenstellm is our answer to that limitation.

At the top sits a refined reasoning model — the brain. Below it sits a colony of compact sub-billion-parameter specialists: organs for code skeletons, JSON repair, claim extraction, contradiction analysis, rendering, critique, cache matching, and wound repair. Around them sits a native runtime that routes tasks, executes checks, writes traces, and records what happened.

Every output passes through a verifier. Every failure becomes poison. Every success becomes food. The Black-Dog reinforcement loop updates routing conductance so the system does not repeat the same mistake forever.

This is not a prompt chain. It is a stitched runtime organism.

The organism is not finished. Its advantage is not that every organ is already strong. Its advantage is that weak organs can be found, measured, wounded, operated on, repacked, and improved.

Results frontier APIs cannot replicate

Five measured wins, not five claims.

A

Parity

HumanEval pass@1 vs PARROT (same-weights Q4 7B): 70 % / 70 %.

Single-shot, same model class, same hardware. The runtime hits the ceiling that the underlying weights allow.

source: SOVEREIGN_WIN_REPORT_V2.md · axis A

B

Repeat-learning

MBPP round 2 (same 20 problems): MONSTER 13 / 20 vs PARROT 12 / 20.

Round 1 PARROT was ahead. Round 2 MONSTER overtook because the organism wrote its own scroll between rounds. PARROT cannot do that.

source: MBPP_REPEAT_LEARNING_V1.md

C

Hologram cache

Identical-prompt second call: 860 ms → 1 ms = 860× speedup.

The second call to the exact same prompt is essentially free. An API charges full price every call, every time.

source: EXACT_REPLAY_CACHE_V1.md

D

Terminal capsules

Terminal-NanoOS-30 tasks: MONSTER 22 / 30 vs PARROT 20 / 30 (+2).

Capsule-isolated shell tasks: the runtime carries verifier and retry context that an API call doesn't get to keep.

source: TERMINAL_NANOOS_30.md

E

Acceptance integrity

Internal acceptance bench: 18 / 18 · identity 14 / 14 · leaks 0.

Architecture audit 10/10. Reproducible decode, deterministic per pack+prompt, no organ leaks across routes.

source: gigachad_acceptance_run_v14_llamacpp.json

These are the architectural wins. Parity on the canonical bench, plus three properties a hosted API cannot reproduce by design — and a clean integrity gate. The organism's value is not raw single-shot scores; it is what it accumulates between calls.

The Stack

Seven systems. One body.

01THE BRAIN TOP-LEVEL REASONING

A refined top-level reasoning model.

The brain is the high-capacity model that handles synthesis, judgment, and difficult reasoning. It is not expected to do everything. It receives work from the lower organs and is called last, not first, when the smaller specialists cannot solve the task alone.

Brain-first routing is the failure mode. If the 7B model handles every task by default, the organism is just an expensive wrapper. The brain exists to rescue, synthesize, and decide — not to answer every question before an organ gets a chance.

RoleTop-level synthesis, judgment, rescue when organs fail.
Donor lineQwen / DeepSeek-class open models — auditable weights, deployable locally.
Policy7B last, organs first. Calling the brain is the fallback, not the default.

Model class 7B Calls last resort Policy organ-first

Live

BRAIN · LIVESYNTHESIS LAYER

02THE ORGANS SPECIALIST COLONY

A colony of small specialist models.

Frankenstellm uses compact 0.5B-class specialists as lower organs. Each organ has one narrow job: produce code skeletons, repair structured output, extract claims, identify contradictions, render commands, critique failures, match memory forms, or patch wounded outputs.

An organ is not decorative. If it is never called, or never improves score, latency, or reliability, it is tissue for the next surgery pass. An organ audit found that five out of eight registered organs had never been called. That finding changed the routing architecture.

Population8 organs defined · 5 in production after BD9 (code_skeleton · triz · wound v2 · json_repair · claim_extractor) · 2 YELLOW (test_writer, cache_matcher) · 1 RED queued BD9.1 (renderer).
ClassSub-billion parameter models — fast, auditable, locally runnable.
CriterionAn uncalled organ is a dead organ. Liveness is a hard requirement.

Defined 8 organs (in spine) In production 5 (post BD9 sweep) Queued BD9.1 1 organ (renderer)

Live

ORGAN COLONY · LIVE0.5B SPECIALISTS

03THE BLACK DOG REINFORCEMENT LOOP

A loop that feeds and starves pathways.

Every route in the system receives food or poison. A successful organ chain becomes easier to choose next time — its conductance increases. A failed route loses conductance. Repeated failures are harvested into surgical training data.

The Black Dog is the system's pain memory. It is how the organism learns what not to do again. When the BD6 surgery pass applied this to the 0.5B code organ — harvesting failed traces, training a QLoRA adapter, merging, repacking, and rerunning — the benchmark score doubled. The mechanism works.

SignalFood on success, poison on failure — conductance updated per route.
MemoryConductance per route — not a global loss function. Surgical and specific.
OutputSurgery datasets — failed traces become QLoRA training material.

Signal food / poison Routes fed 0 Starved 0

Live

BLACK DOG · LIVECONDUCTANCE ROUTING

04MEMORY SPINE PERSISTENT RECALL

Persistent recall. Not chat history.

Frankenstellm does not rely on conversation context as memory. It uses a structured archive of indexed records, reports, decisions, and execution traces. A claim can be tied back to where it came from. A previous failure can be retrieved and used as training material.

Memory is not decoration. It is anatomical continuity. Without it, the organism forgets its wounds. The same failure repeats. The same wrong route gets chosen again. The spine is what gives the system a history it can act on.

RecallVolume / line / record precise — not approximate semantic search.
ContentsIndexed records, execution traces, decisions, failure logs, repair histories.
UseSource grounding, failure harvest, route memory, audit trail.

Records 0 Traces indexed Recall line-precise

Live

MEMORY SPINE · LIVEPERSISTENT ARCHIVE

05THE VERIFIER IMMUNE LAYER

A hard gate against hallucination.

The verifier is the system's immune layer. Code is compiled. JSON is parsed. Terminal tasks run in capsules. Claims require source pointers. If an answer cannot pass the relevant check, it is not treated as complete.

A model can guess. The verifier decides whether the guess survives. The default stance is suspicious. An answer that cannot be verified is not an answer — it is raw material for the wound system.

DefaultSuspicious. No output is accepted without passing its check.
ChecksCompile, parse, execute, hash, source-pointer. Task-specific.
OutputPass / fail / evidence. Failures route directly to the wound system.

Default suspicious Passed 0 Failed 0

Live

VERIFIER · LIVEPASS / FAIL / EVIDENCE

06THE WOUND SYSTEM FAILURE → SURGERY

Failure becomes training material.

When an organ fails, the system does not simply discard the output. It records the task, the organ response, the verifier error, the stderr, the expected behavior, and the eventual repair. These wounds become the dataset for the next surgery pass.

This is the central difference between a chatbot and an organism: failure is metabolized. The BD6 surgery pass proved this in practice — a 0.5B code organ improved from 6/100 to 13/100 on MBPP after one wound-harvest-train-repack cycle. The numbers are still small. The mechanism is real.

InputFailed traces — task, response, error, stderr, expected output.
ProcessPoison harvest → QLoRA → merge → repack. One full surgical loop.
OutputA stronger organ. Benchmarked before and after. No exceptions.

BD6 result MBPP 6→13 / 100 HumanEval 2→6 / 164

Live

WOUND SYSTEM · LIVEFAILURE IS METABOLIZED

07THE BODY NATIVE RUNTIME

A compiled runtime that ships the organism.

Frankenstellm is not intended to live as Python glue. The live path belongs in a compiled native runtime: routing, model loading, verification hooks, DAG traces, capsules, and memory all sit under one local body. Python may exist in the operating theatre as a temporary surgical tool. It does not become the organism.

Compile what you ship. Research tooling can use Python. Production inference, routing, verification hooks, packs, and memory belong in the native runtime. The body is what makes the organism sovereign — deployable on consumer hardware, without cloud dependency, without telemetry.

RuntimeC++ / CUDA / native packs — compiled, not interpreted.
TargetLocal consumer hardware — sovereign deployment, no cloud dependency.
DoctrineCompile what you ship. Python is for the operating theatre only.

Runtime C++ / CUDA Target consumer GPU Dependency none

Live

NATIVE BODY · LIVECOMPILE WHAT YOU SHIP

Selected Work

What happened when we ran the numbers.

Case 01 · BD6 Surgery

The first organ surgery loop.

We measured the raw 0.5B code organ against public coding tasks and found the truth: it was fast, but weak. On MBPP it solved 6 out of 100 tasks. On HumanEval it solved 2 out of 164. That failure was not hidden. It became surgical material.

The Black-Dog pipeline harvested the failed traces, joined them with reference solutions, trained a QLoRA adapter, merged it back into the organ, repacked into the native format, and reran the same benchmarks — with no 7B fallback allowed.

Organphys05_code_skeleton

MBPP before6 / 100

MBPP after BD613 / 100

HumanEval before2 / 164

HumanEval after BD66 / 164

7B fallback calls0

Case 02 · A/B/C Truth Table

Organ truth table: three routing modes.

Before surgery, we forced the system into three modes: A (7B only), B (0.5B organ only), C (organ-first with 7B fallback). The result was uncomfortable and useful. On MBPP: A scored 60, B scored 6, C scored 60. On HumanEval: A scored 81, B scored 2, C scored 81.

The conclusion was clear: the runtime was wired correctly, but the small organs were not yet strong enough to improve on the top brain. That finding became the reason for BD6.

MBPP — 7B only (A)60 / 100

MBPP — organ only (B)6 / 100

MBPP — organ+fallback (C)60 / 100

HumanEval — 7B only (A)81 / 164

HumanEval — organ only (B)2 / 164

ConclusionOrgans needed surgery

Case 03 · Organ Audit

The organ liveness audit.

A traffic audit showed that earlier runs were too often 7B-monolithic. In the last 500 DAG entries before rewiring, every call went through the 7B chat path. Five of eight registered 0.5B organs had never been called at all. The wound organ did not exist yet.

That audit changed the architecture. Frankenstellm now treats organ liveness as a hard requirement: if an organ is not called, logged, scored, and improved, it is not part of the body.

DAG entries audited500

7B-only routes500 / 500

Dead organs found5 / 8

Wound organ statusDid not exist

CorrectionOrgan-first rewiring + surgery

Case 04 · Proof-Carrying Execution

Answers that carry their own evidence.

Frankenstellm does not merely write shell commands or code — it can execute them inside controlled capsules, capture stdout and stderr, verify artifacts, hash outputs, and preserve a replay recipe. This turns answers into evidence.

When a route passes, the system can show exactly what ran. When it fails, the wound becomes training material. An output without a verifier trace is only text. Frankenstellm is built to return not just an answer, but the evidence that produced it.

ExecutionCapsule-based, sandboxed

Evidencestdout / stderr / hashes

ReplayPreserved per task

Failure useWound → surgery dataset

Production system · Mode C

What the full system actually does on benches.

Mode C is the production path: organ-first, with 7B top-brain fallback if the organ chain fails verification. These are 7B-class numbers — what the system delivers when allowed to use its full anatomy.

Benchmark	n	Pass	Wall	Organs used	7B fallback
MBPP	100	60 / 100	5 353 s	phys05_code_skeleton + 7B	99
HumanEval	164	81 / 164	8 629 s	phys05_code_skeleton + 7B	164
ARIZ TRIZ contradictions	100	88 / 100	~5 s/task	phys05_triz_contradiction	0
Terminal-NanoOS	30	22 / 30	25 s/task	shell capsule + 7B	—

source: BENCH_CLEANUP_AND_OFFICIAL_RUN.md · BD7_TRIZ_SURGERY_FINAL.md · TERMINAL_NANOOS_30.md

Surgery evidence · Mode B

What the surgery did to the organ alone.

Mode B disables the 7B fallback so we can measure what BD6 actually changed in the 0.5B specialist. This is not the system's production performance. It is the surgical delta on the organ — proof the surgery loop has a real effect, not a marketing number.

Benchmark	Before BD6	After BD6	Δ	Anchor
MBPP	6 / 100	13 / 100	+7 (+117 %)	19/19
HumanEval	2 / 164	6 / 164	+4 (+200 %)	19/19
LiveCodeBench easy	0 / 50	0 / 50	0 (dispatcher routes LCB to ARIZ — out of scope for this organ)	—

source: BD6_POST_SURGERY_DELTA.md · MBPP_HE_3MODE_V1.md

The surgery loop:

bench failures  →  poison dataset  →  QLoRA on 0.5B  →  merge  →  repack
                →  anchor gate (19/19 required)  →  flip pack  →  re-bench
                →  KEEP if all four gates pass, otherwise REVERT.

Eight surgery passes were reverted before BD6 pass-1 was kept. Reverts are recorded as negative results, not hidden.

source: BD6_2_OVERTRAIN_DELTA.md … BD6_8D_RANK_FINAL_FREEZE.md

Anatomy

The organism, diagrammed.

A real diagram of the live runtime — top-brain Q4 7B, five production organ packs after the BD9 sweep (code_skeleton · triz · wound v2 · json_repair · claim_extractor), the BD8 critic+wound queued for retraining, the verifier as the immune layer, the Black-Dog router that updates conductance per route, and the line-addressable memory spine underneath.

How a request flows

Seven steps, one DAG entry.

01 Dispatch. The dispatcher classifies input (regex + heuristics) into a route — json / code / ariz / claim / cache / wound / renderer / chat.
02 Router consults conductance. For the (route, organ_chain) pair, the Black-Dog router selects the chain with the best food-vs-poison ratio.
03 Lower organ runs first. A 0.5B specialised pack handles the request — typically 3–5 s on RTX 3060 Ti.
04 Verifier checks structure. JSON schema, code compile, test runs, source presence — task-specific gates.
05 Critic + wound repair. If verifier fails, wound v2 attempts in-chat repair (BD9 retained for chat path); for ARIZ JSON specifically the rescue rate is still 0 across BD8 V1–V5 — BD8 retraining queued.
06 7B top-brain synthesis. Only if step 5 also fails — exactly one 7B call.
07 DAG entry written. Final answer plus organ_chain, food, poison, conductance delta, verifier reason, fallback used — every request leaves an audit trail.

Sovereign comparison

Four axes versus same-weights PARROT.

Honest axis-by-axis from SOVEREIGN_WIN_REPORT_V2.md. Axis A is parity. Axes B / C / D are architectural advantages a frontier API cannot reproduce, because they require a local runtime with persistent state and a verifier loop.

Axis	What	Result	What an API cannot do
A	HumanEval pass@1 vs same-weights PARROT	14/20 = 70 % · parity	—
B	MBPP repeat-learning ×3 rounds	round 2: 13/20 vs PARROT 12/20 · ✅ MRL	API has no round-2-with-evidence loop
C	Hologram cache · identical prompt	860 ms → 1 ms · 860× speedup	API charges full price every call
D	Evidence per answer	DAG node + organ chain + gate result	API returns a string

Production state

Working — and not yet working.

Working

C++ runtime · native CUDA forward · single binary.
Production code-skeleton organ · MBPP 13/100, HE 6/164, LCB 0/50 · anchor 19/19 · frozen.
Production TRIZ organ · ARIZ 88/100 strict · fallback 0.
BD9 phys05_json_repair · 10 / 10 GREEN on production failure catalog.
BD9 phys05_claim_extractor · GREEN · clean structured-JSON output.
Production grew from 2 to 5 surgered organs in one BD9 session.
Q4 Physarium-7B · 83.58 tok/s production (llama.cpp) · 18.27 native default · 28.99 with DP4A flag · 5.55 GB VRAM · RTX 3060 Ti.
Memory spine indexed · 305 files · 58 996 lines.
Acceptance suite 18/18 · identity probe 14/14.
Hologram cache · 860 ms → 1 ms on identical input (860×).
Repeat-learning round 2 · MBPP-20 · 13/20 vs PARROT 12/20.
Terminal-NanoOS-30 · MONSTER 22/30 vs PARROT 20/30 (+2).

Not yet at production gate

BD8 critic + wound rescue rate · 0/n on ARIZ JSON · wound v2 retained for in-chat path · BD8 retraining queued.
BD9 phys05_test_writer YELLOW · pytest shape correct, semantics drift.
BD9 phys05_cache_matcher YELLOW · correct integer + post-answer drift.
BD9 phys05_renderer RED · output corrupted on free-form bash · queued BD9.1 (50+ rows or r=16).
Black-Dog conductance arbitration in Python harness only · C++ port queued.
Critic + wound for non-terminal routes.
Memory exact-lookup CLI · not built.
Memory TF-IDF semantic ranker · not built.
GPQA Diamond runner · gated on dataset auth.
SWE-bench Lite runner · gated on Phase-12 NanoOS shell.
BFCL 3-mode runner · partial harness.
Routing-field heuristic (paused).

Acceptance Integrity Ladder

Nine runs, no regression.

The 18-task curated acceptance suite is the integrity gate every change has to pass. Identity probe, organ-leaks gate, structural verifier, end-to-end JSON / code / claim / memory routes. No surgery merges if this regresses.

Run	Result	Identity	Wall	Note
v14 · llama.cpp backend	18/18	14/14	2.99 s	production ceiling · CURRENT_TRUTH_LEDGER §1
v15 · DP4A native	17/18	—	—	opt-in flag · close to 18/18 with G2.b/G3/G4 fixes
v16 · after Gap C close	18/18	—	—	GAP_C_KILL_FINAL.md
v17 · llamacpp after Gap C	18/18	—	—	—
v18 · after G3 (Python compile probe)	18/18	—	—	verifier hardening
v19 · after holographic form replay	18/18	—	—	HOLOGRAPHIC_FORM_REPLAY_V1.md
v20 · after native CR	18/18	—	—	code-repair loop in native runtime
v21 · anchored preamble	18/18	—	—	identity anchor in system message
v22 · post anchor	18/18	—	—	stable post-9F LoRA anchor

v14 confirmed in CURRENT_TRUTH_LEDGER §1. v15 → v22 trajectory documented in HISTORY_TREE.md §5/§6; per-run JSONs preserved in master logs. Headline: every shipped change passed acceptance.

Reports

Every claim has a file behind it.

Roadmap

FRANKENLLM_ROADMAP_STATUS_V1.md — master 8-track status

Truth Ledger

CURRENT_TRUTH_LEDGER.md — single source of truth

BD7

BD7_TRIZ_SURGERY_FINAL.md — 0 → 88/100 strict 6-field JSON

BD9

BD9_JSON_REPAIR_FINAL.md — phys05_json_repair 10/10 GREEN on production failure catalog

BD9

BD9_FOUR_ORGANS_FINAL.md — four-organ sweep, production grew 2 → 5 organs

Bench

MBPP_HE_3MODE_V1.md / .json — MBPP 13/100 · HE 6/164 (Mode-B, no fallback)

Bench

SOVEREIGN_WIN_REPORT_V2.md — 4-axis honest comparison vs PARROT

Bench

SOVEREIGN_COGNITION_GAUNTLET_V1.md — RED 59/60 vs PARROT 60/60 · honest

Bench

TERMINAL_NANOOS_30.md — MONSTER 22/30 vs PARROT 20/30

Bench

EXACT_REPLAY_CACHE_V1.md — 860 ms → 1 ms cache hit

Bench

MBPP_REPEAT_LEARNING_V1.md — round 2 · MRL ≥ PARROT

Runtime

RUNTIME_ORGANISM_BENCH_V1.md — BD2 + BD4 first integration

Audit

GIGACHAD_SYSTEM_INTEGRITY_AUDIT.md — 10-layer connectivity check

Audit

CLEAN_ROOM_DOCTRINE.md — externals are autopsy specimens, never spine

All reports live in /home/pc/gigachad_native/reports/. Numbers on this page point back to them.

Principles

Five rules that govern the organism.

01

Organs must earn their place.

A specialist model that is never called is dead. A specialist that does not improve score, latency, or reliability is tissue for the next surgery pass. Organ liveness is a hard requirement, not a preference.

02

The brain is last, not first.

The top model should not handle every task by default. Lower organs attempt narrow work first. The brain synthesizes, judges, or rescues only when organs cannot carry the load. Brain-first routing is the failure mode.

03

Failure is not waste.

Every failed output is a training example waiting to be harvested. The verifier produces the wound. The Black Dog marks it as poison. Surgery turns it into a stronger organ. This cycle is the organism's immune system.

04

No proof, no claim.

An answer without a verifier trace is only text. Frankenstellm is built to return not only an output, but the evidence that produced it. An unverified output is raw material, not a result.

05

Native is the body.

Research tooling can use Python. The live organism cannot depend on it. Production inference, routing, verification hooks, packs, and memory belong in the native runtime. Compile what you ship.

A stitched intelligence. Built from operated models.

HumanEval pass@1 vs PARROT (same-weights Q4 7B): 70 % / 70 %.

MBPP round 2 (same 20 problems): MONSTER 13 / 20 vs PARROT 12 / 20.

Identical-prompt second call: 860 ms → 1 ms = 860× speedup.

Terminal-NanoOS-30 tasks: MONSTER 22 / 30 vs PARROT 20 / 30 (+2).

Internal acceptance bench: 18 / 18 · identity 14 / 14 · leaks 0.

A refined top-level reasoning model.

A colony of small specialist models.

A loop that feeds and starves pathways.

Persistent recall. Not chat history.

A hard gate against hallucination.

Failure becomes training material.

A compiled runtime that ships the organism.

The first organ surgery loop.

Organ truth table: three routing modes.

The organ liveness audit.

Answers that carry their own evidence.

The surgery loop:

Working

Not yet at production gate

Organs must earn their place.

The brain is last, not first.

Failure is not waste.

No proof, no claim.

Native is the body.