The world has produced extraordinary open foundation models, and almost no rigorous practice for shaping them after release. Our laboratory exists to close that gap.

Our active patients are open foundation models — Qwen 2.5 0.5B as the lower-organ donor and Qwen 2.5 7B as the top-brain — alongside an autopsy archive of larger artefacts including DeepSeek V4-Flash. The standard answer to making any such model behave the way you need it to is fine-tuning — and fine-tuning, in practice, is hope dressed in machinery. You feed in data; you wait; you ship whatever comes out; you cannot say with certainty what you changed.

Surgery operates differently. We treat the weight space as anatomy. We open the model up, we identify the structures responsible for a behaviour we want to change, and we modify them — locally, measurably, reversibly. Where the field at large grows models by accumulation, we refine them by intervention.

The work spans every layer of the stack. The compression formats that let very large models run on small machines. The native compiled runtime that replaces the research-grade scripting most laboratories ship with. The internal memory that gives a refined model something resembling continuity. The verifier that tells us, on every output, whether the answer is supported. None of this exists in isolation. All of it serves the same goal: to produce systems whose behaviour we understand and whose claims we can defend.

The Stack

Six systems. One operating theatre.

01

The Brain

A central reasoning model — refined, not retrained.

At the centre of every system we ship sits a single high-capacity reasoning model, derived from an open foundation through targeted intervention. It carries the breadth of a frontier model and the discipline of a system whose behaviour was shaped one circuit at a time.

Class Top-level reasoning Origin Open foundation

02

The Organs

A farm of small specialists.

Around the central model sits a fleet of compact specialists — each one a sub-billion-parameter expert at a single narrow function: structured output, code skeleton, claim extraction, contradiction analysis. They run cheaply, in parallel, and answer to the model above them.

Population 5+ specialists Each < 1B parameters

03

The Memory

A structured spine of persistent recall.

A reasoning system without memory begins every encounter from zero. Ours does not. A structured archive of hundreds of indexed volumes, addressable down to the line, gives the assembled system something other models lack: a continuous record of what it has thought, what it has been told, and where each fact came from.

Spine 305 files · 58 996 lines Address sha256[:16] per line

04

The Bloodstream

A routing field that learns its own paths.

Every request flows through a self-organising routing layer that decides which organs to wake, which memories to retrieve, and which paths to reinforce or starve. It is, in effect, a circulatory system — quiet, adaptive, and the reason the assembled body responds as one.

Substrate Adaptive routing Property Self-pruning

05

The Verifier

A hard gate against fabrication.

No claim leaves the system without passing a strict verifier. Every assertion that references memory must carry a pointer to its source. Every output that cannot be backed up is flagged as such, in plain language, before it reaches the user. The default, in our laboratory, is suspicion.

Output Source-pointed Default Skeptical

06

The Body

A native compiled runtime.

Most laboratories ship Python. We ship a compiled native runtime. Memory, model loading, attention kernels, tiering between fast and slow storage — all of it written in low-level systems code. The result is a complete model deployed on a single consumer GPU: less hardware, less latency, no scripting language between the model and the machine.

Language Native compiled Target Single consumer GPU

Selected Work

What this laboratory has done.

CASE 01 · HISTORICAL

An open frontier-grade model, taken apart on a single consumer GPU.

In the V4-Flash autopsy line (Q1 2026), we took DeepSeek V4-Flash — an open-weight Mixture-of-Experts foundation model with 284 billion parameters in total, 13 billion active per token, a one-million-token context window, an MIT license, and an inference profile that ordinarily requires multiple data-centre-class GPUs to operate — and we drove it through end-to-end inference on a single consumer GPU.

Measured: 1.86 tok/s decode, 2.08 sec/tok on the 18-layer text decode after expert-streaming repack, with 89 % of wall time in expert I/O. The artefact taught us that the bottleneck is not arithmetic — it is the choreography of moving specialised experts in and out of working memory. Currently active: our working Python inference reference now produces the correct Paris top-1 answer with a +11.13 logit margin on the same RTX 3060 Ti — released as open-source pipeline with 8 documented architectural findings.

SubjectDeepSeek V4-Flash

Total / active284 B / 13 B

Paris top-1 logit+40.75 (margin +11.13)

StatusPython reference active · native port in progress

CASE 02

A small specialist with a fifth of its weight removed — measured, not assumed.

Targeted excision on Qwen 2.5 0.5B (Physarum-05B donor tissue): 14.94 % global / ≈ 22 % per-tensor weights zeroed, guided by an internal signal that identifies parameters with no measurable contribution. The behaviour delta was not invisible. Perplexity rose +12.5 %; MMLU-mini lost −22 %; GSM8K-mini lost −20 %. JSON-schema and code-skeleton smoke survived. Throughput was preserved.

This was the proof of principle, not the proof of victory. A model is not a single inseparable thing — but it is also not a free dinner. Healthy and dead tissue can be distinguished; what survives the operation depends on the gate you used. We publish the deltas instead of hiding them, because the trajectory matters more than any one pass.

DonorQwen 2.5 0.5B

Weights zeroed14.94 % / ≈22 % per-tensor

PPL drift+12.5 %

MMLU-mini / GSM8K-mini−22 % / −20 %

CASE 03

A custom expert-streaming format that gave the inference loop a six-fold speed-up.

The single largest cost in operating a model the size of DeepSeek V4-Flash on consumer hardware is not arithmetic. It is the choreography of moving the model's hundreds of specialised experts in and out of working memory, again and again, on every forward pass.

By reorganising the way these specialised parts are packed and streamed from disk, we turned the most expensive operation in the inference pipeline into a tractable one. The same model, on the same hardware, ran roughly six times faster on its decode loop. The format is internal, instrumented, and reproducible. We use it now as the substrate for everything else.

BottleneckExpert streaming

Speed-up≈ 6 ×

HardwareUnchanged

StatusIn production

CASE 04

A line-addressable memory spine, indexed end to end.

The current memory spine is a structured archive of 305 source files / 58 996 lines: organ-surgery transcripts, docs, reports. Every line carries a sha256[:16] address; the manifest lives at data/memory_spine/manifest_v1.jsonl. Breakdown: 45 organ-surgery files (10 863 lines), 20 docs (4 783 lines), 240 reports (43 350 lines).

Indexing is shipped. Exact-lookup CLI and TF-IDF semantic ranker are queued, not done — we say so explicitly. It is not a chat history and it is not a vector store. It is the spine on which exact-recall and contradiction-detection will sit.

Files305

Lines58 996

Addresssha256[:16] per line

Lookup CLIqueued

On the Table

Patients currently in the laboratory.

Each patient that enters the runtime carries four artefacts: a baseline number, a post-surgery delta, a verifier-checked anchor set, and a frozen pack hash. No merge without those four.

Patient	Size	Role	Status
SmolLM2-135M (NanoAgent)	135 M / 270 MB f16 · Apache 2.0	agentic tool-call organ · fast local dispatch	in production · 0 % → 74 % tool-call accuracy · identity edited via weight surgery (ROME/MEMIT), not training
Qwen 2.5 0.5B (Physarum-05B)	0.5 B / 988 MB · Apache 2.0	lower-organ base	in production · multiple specialised packs
Qwen 2.5 7B (Physarium-7B Q4)	7 B / 5.55 GB Q4 · Apache 2.0	top-brain · 7B fallback	in production · 83.58 tok/s (llama.cpp) · 18.27 tok/s native default · RTX 3060 Ti
DeepSeek V4-Flash	284 B total / 13 B active · MoE · 1 M context · MIT	frontier MoE host · expert-streaming · native inference engine	Python reference active · Paris top-1 +11.13 logit margin · native engine in active development
DeepSeek V4-Pro	1.6 T total / 49 B active · MoE · 1 M context · MIT	frontier reference · agentic coding (SWE-Bench frontier tie)	instrumented for runtime study · architectural diff vs V4-Flash captured
Qwen 3.5 small (0.6B · 1.7B · 4B)	0.6–4 B · Apache 2.0	next-gen organ base · replaces Qwen 2.5 0.5B donor line	queued · BD10 sweep
Qwen 3.5 mid (8B · 14B · 32B)	8–32 B · Apache 2.0	top-brain candidate · 7B fallback successor · 32B = best small dense coder	queued · BD11 · scoping vs Physarium-7B
Qwen 3.6-27B-Coder	27 B dense · Apache 2.0	code-organ specialist · best small dense coder under Apache-2.0	queued · code-organ BD candidate
Qwen 3-235B-A22B	235 B total / 22 B active · MoE · Apache 2.0	flagship MoE organ host · safest enterprise license · expert-streaming target	queued · MoE BD12 candidate
Llama 4 Scout	109 B total / 17 B active · MoE · 10 M context · Llama license	long-context multi-organ scaffold candidate	scoping · license review pending
Llama 4 Maverick	400 B total / 17 B active · MoE · Llama license	frontier-tier MoE candidate · larger experts vs Scout	scoping · expert-streaming bench candidate
Gemma 4 E2B / E4B	2 B / 4 B effective · Apache 2.0	edge / mobile organ class	queued · multimodal scoping
Gemma 4 26B MoE	26 B / 3.8 B active · MoE · Apache 2.0	MoE-organ research host · pair with V4-Flash native work	queued · expert-streaming bench candidate
Gemma 4 31B Dense	31 B · Apache 2.0	top-brain candidate · 70B-class behaviour at consumer-VRAM cost	queued · vs Qwen 3.5 32B head-to-head
Phi-4 Reasoning Plus	14 B · MIT	reasoning organ · rivals far larger models on complex reasoning	queued · reasoning-organ BD candidate
Phi-4 Mini	3.8 B · 128 K context · MIT	edge organ · long-context on resource-constrained hardware	queued · edge-organ candidate
Mistral Large 3	flagship dense · Apache 2.0	top-brain candidate · open re-licensing of Mistral flagship	scoping
Mistral Small 4	24 B · Apache 2.0	efficient dense organ · code + general	queued · vs Qwen 3.5 32B
OLMo 2	7 B / 13 B · fully open (data + checkpoints + logs) · Apache 2.0	research-grade donor · only family with reproducible training trace	scoping · ideal for surgery instrumentation
Kimi K2.6 (Moonshot)	open-weights	frontier candidate · #1 Artificial Analysis Index (54) · long-context	scoping · license review
GLM-5.1 (Z.ai / Zhipu)	open-weights · MIT	multilingual organ · cleanest MIT license among Chinese frontier opens	scoping
Nous Hermes 4	Llama 4 base · community fine-tune	chat / instruction / persona-tuned organ candidate	scoping
DeepSeek-R1-Distill-Qwen-1.5B	1.5 B · MIT	reasoning-organ candidate	scoping

Native Speed Ladder

From 1.91 tok/s to 83.58 tok/s.

Every step measured on the same RTX 3060 Ti, Physarium-7B Q4. Each row is a milestone in the native runtime; each number has a report file. The ladder is the headline arc of Phase 6 → Phase 13.

Phase / Configuration	Speed	vs prev	Note
V4-Flash 284B PyTorch warm decode	p50 9.6 s/tok	—	flagship demo · 8 GB VRAM
Physarum-05B-Organic baseline	27.15 tok/s	—	0.5B BF16 baseline
CPU baseline · 0.5B	1.91 tok/s	—	reference floor
CUDA full GPU 0.5B (Phase 8E.1)	116 tok/s	61× CPU	byte-identical
CUDA fused 7B BF16 streaming (8E.2)	0.20 tok/s	—	correctness proof, not main path
Q4 NUCLEAR resident 7B (8E2)	11.16 tok/s	280× CPU baseline	5.55 GB Q4 group=128 · 28 layers in VRAM
Q4 native v2 default `--chat`	18.27 tok/s	+64 % vs NUCLEAR	—
Q4 native + DP4A=1 (opt-in)	28.99 tok/s	+59 %	—
Q4 native + DP4A · tg128	41.69 tok/s	+44 %	—
llama.cpp backend (LLAMACPP_URL)	83.58 tok/s	+100 %	production · clean-room autopsy
Mode C llama.cpp acceptance · mean wall	2.99 s	—	per query, 18-task suite

sources: reports/EXTERNAL_BACKEND_SHOOTOUT_V2.md · reports/PHASE_8E8A_DP4A_NATIVE_BACKEND.md · reports/CURRENT_TRUTH_LEDGER.md §2 · 5-run mean on RTX 3060 Ti

Surgery Cycle Ledger

The trajectory matters more than any one pass.

Eight surgery passes were reverted before the production code-skeleton organ was kept. We publish the trajectory because failed passes are the proof that the gate doctrine is real, not a slogan.

Pass	Lever / variant	Outcome
BD6 pass-1	physarum05b_code_skeleton.planck	KEPT · production · 13/100 MBPP, 6/164 HE, anchor 19/19
BD6.2	retraining	REVERTED · overtrain, MBPP regressed 13 → 6
BD6.3	anchor gate	REVERTED · failed gate
BD6.4	anchor positive partial	REVERTED · partial
BD6.5	stratified poison 15/19	REVERTED · 13/19 anchor
BD6.6	stratified mix v6	REVERTED · over-anchor regression
BD6.7	KL-anchor ladder λ=0.10/0.20	REVERTED · no lift
BD6.8D	token-weighted CE	REVERTED · no lift
BD6.8D2	per-bench poison + asymmetric holdout	REVERTED · over-tuned
BD6.8D-rank	r=16 / α=32	FREEZE DECISION · ship pass-1, freeze BD6.x
BD7	triz_contradiction_v2.planck	KEPT · 88/100 strict 6-field JSON, fallback 0
BD8 V1–V5	critic_lite + wound (ARIZ rescue path)	BLOCKED · rescue 0/n on ARIZ JSON · wound v2 retained for in-chat rescue
BD9	phys05_json_repair	KEPT · 10 / 10 GREEN on production failure catalog · loss 0.055 → 0.0003 · 280 rows
BD9	phys05_claim_extractor	KEPT · GREEN · clean structured JSON · loss 0.51 → 0.04 · 25 hand-curated rows
BD9	phys05_test_writer	YELLOW · pytest shape correct, semantics drift (currying confusion + Human-token leak)
BD9	phys05_cache_matcher	YELLOW · correct integer + post-answer drift · runtime regex extracts head
BD9	phys05_renderer	RED · output corrupted · loss did not converge (0.69 ceiling on 25 rows) · queued BD9.1

NanoAgent — 135 M patient

Eleven passes, three explicit reverts, one identity done by editing, not training.

SmolLM2-135M-Instruct, untouched, scores 0/30 on real agentic tool-calls — it was never shown the shape of that job. Ten training passes and two direct weight-edit rounds later it holds 74 % on the same test, against 70 % for Qwen3.5-2B (15× the size) and 93 % for Gemma4-E2B (38× the size) — at 89 tok/s on an RTX 3060 Ti and 63 tok/s on CPU alone, no GPU at all.

Pass	Lever / variant	Outcome
v1–v4	QLoRA agentic base · 7B-teacher distillation · reasoning layer	KEPT · foundation · order-DB + tool-call schema learned from zero
franken50	`peft.add_weighted_adapter` — literal v3+v4 weight merge	KEPT · single checkpoint carrying both skill sets
v5	date-reasoning patch, no replay anchors	REVERTED · silently regressed task-dependency reasoning (5/5 identical wrong answers at temp 0)
v6	date patch + replay anchors from every prior skill	KEPT · fixed v5's regression, replay-anchor discipline adopted from here on
v7	Russian-language patch, 15 examples + anchors	REVERTED · echo-collapse — model repeated input verbatim instead of answering
v8	identity (name) + graceful uncertainty + multi-turn memory, dedicated training	REVERTED · identity failed to stick even on the exact trained question · date-reasoning regression despite an anchor present
v9	paraphrase robustness for tool-selection	KEPT · natural-phrasing tool accuracy 50 % → 67 %
v10	root-caused train/inference chat-template mismatch + full day-of-week coverage + harvested hard negatives	KEPT · production base · 58 % → 74 % tool accuracy · day-of-week 0/7 → 5/7
v11	Gemma4-E2B offline distillation for general Q&A	REVERTED as strategy · "the model should reason from its own knowledge, not borrow ours" · kept only as a benchmark reference, never shipped
self-reflect	draft → self-critique → revise loop, replacing an earlier best-of-N self-judge	KEPT · +0.25 to +0.66 quality (5-pt scale) · the self-judge variant it replaced measurably made output worse and was deleted, not disabled
ROME v1	single rank-1 edit, layer 29, "My name is ___" pattern only	REVERTED · 2/7 real-world phrasings — model still opened most replies with the wrong name before the edited continuation ever fired
MEMIT v1	multi-key ridge-regularised edit, single layer, 16 phrasings jointly	REVERTED · 27/27 in float32, but several cases silently flipped back to the wrong name once quantized to f16 GGUF — the actual serving format
MEMIT v2	two-pass edit across 5 layers — one pass for the "name is ___" continuation, one for the actual first generated token, jointly fit against 27 phrasings	KEPT · production · 15/17 on the real quantized runtime, incl. phrasings never trained on · reasoning & tool-call output byte-identical before/after — the edit never touched anything outside the identity circuit

tool-call accuracy measured on a fixed 30-case natural-language suite, identical runtime/grammar/dispatcher across every row · speed measured live on an idle RTX 3060 Ti (GPU) and a 12-thread i5-12400F (CPU-only, no offload)

Open-source reference

DeepSeek V4-Flash — working Python pipeline.

We are releasing the full Python reference pipeline for DeepSeek-V4-Flash inference — the same code that produces the correct Paris top-1 answer with a +11.13 logit margin on a single RTX 3060 Ti via WSL2.

This is the reference oracle: anyone porting V4-Flash to a different runtime (CUDA, Triton, MLX, Rust, anything) can use it to cross-check activations layer-by-layer with compare_dumps.py. Architecture nuances that broke our own ports are documented openly in PYTHON_PIPELINE_DOC.md so nobody has to rediscover them.

License: MIT · Model weights: DeepSeek-AI MIT · Verified 2026-05-31

Download

📦 tar.gz · 26 KB

python_v4_paris_pipeline.tar.gz

7 files: flash_mvp.py · flash_mvp_chat.py · kernel_pytorch.py · dump_ref_v4.py · compare_dumps.py · fht_fallback.py · doc

sha256 3116b742…d16a5fdf

📄 markdown · 11 KB

PYTHON_PIPELINE_DOC.md

Files & roles, run instructions, 8 architecture findings, per-layer activation reference, numerical validation framework, handoff checklist.

8 architecture findings that broke our own ports

Reading model.py isn't enough — the model gives garbage without each of these. Documented openly so nobody has to rediscover them.

#	Subsystem	What's easy to get wrong
1	Chat template	Without the trailing `<｜Assistant｜></think>` wrapper, the model emits garbage. Non-thinking mode is encoded by that closing `</think>`.
2	Hash routing computes `original_scores`	Layers 0/1/2 route via `tid2eid`, but weights still come from `sqrt(softplus(x · Wᵀ))` followed by gather + normalisation + scaling factor 1.5. Naive uniform `1/top_k` weights destroy magnitudes.
3	`act_quant` double-apply trap	For GEMM input, `act_quant` returns `(y_raw, scale)` and the GEMM applies scale internally. For inplace simulation, it writes back `y · scale`. Confusing them = silent 1000× underflow in MoE.
4	Compressor overlap a/b split	CSA (m=4) compressor weights split `wkv`, `wgate`, `ape` into "a-portion" (previous chunk in overlap window) and "b-portion" (current chunk). HCA (m=128) has no overlap → single weight.
5	mHC `C_l = 2 · sigmoid`	Output mapping for Manifold-Constrained Hyper-Connections is 2 × sigmoid (range [0, 2]) — the factor 2 is critical and easy to miss.
6	SwiGLU asymmetric clamp	Up-component clamped to `[-10, +10]`. Gate component capped at `+10` only — no lower cap. The asymmetry is real and intended.
7	Inverse RoPE on attention output	After attention, before the grouped output projection, apply inverse RoPE to the last 64 dims of the attention output ("Partial RoPE", paper §2.3.3). Missing this destroys long-context coherence.
8	Per-head Q RMS without learned weight	After `wq_a → q_norm → wq_b`, apply per-head RMS with no learned weight: `q *= rsqrt(q.square().mean(-1) + eps)`. Skipping it silently degrades head specialisation.

Full algebraic derivations + per-layer activation reference (21 layer-0 dumps + 9 deeper layer inputs + final logits + pre-head) in PYTHON_PIPELINE_DOC.md §4 and §5.

Reproduce Paris in 4 commands

tar xzf python_v4_paris_pipeline.tar.gz && cd v4_pipeline
pip install torch transformers safetensors numpy
export V4_MODEL=/path/to/v4_original   # HF checkpoint with safetensors + inference/
python3 flash_mvp_chat.py --ckpt $V4_MODEL \
    --user "What is the capital of France? Answer in one word." \
    --mode chat --max-seq 64
# → predicted token 51119 ('Paris'), logit +40.75

Generate reference dumps for porting validation: REF_N_LAYERS=43 python3 dump_ref_v4.py (~11 min on RTX 3060 Ti).

Reports

Selected publications.

Truth Ledger

CURRENT_TRUTH_LEDGER — what was measured vs what was claimed (single source of truth).

reports/CURRENT_TRUTH_LEDGER.md2026-05-01

Surgery Ledger

BD6 trajectory — code-skeleton organ frozen at 13/100 MBPP after 8 reverted passes.

reports/BD6_POST_SURGERY_DELTA.md (+ BD6_2…BD6_8D_RANK)2026-04

Surgery Report

BD7 TRIZ surgery — 0 → 88 / 100 strict 6-field JSON across seven training stages.

reports/BD7_TRIZ_SURGERY_FINAL.md2026-04

Surgery Report

BD9 — phys05_json_repair, 10 / 10 GREEN on production failure catalog (first organ at 100 %).

reports/BD9_JSON_REPAIR_FINAL.md2026-05-04

Surgery Sweep

BD9 — four-organ sweep: 1 GREEN · 1 GREEN · 2 YELLOW · 1 RED → production grew from 2 to 5 organs.

reports/BD9_FOUR_ORGANS_FINAL.md2026-05-05

Inventory

Memory spine inventory — 305 files / 58 996 lines / sha256[:16] per line.

reports/MEMORY_SPINE_INVENTORY_V1.md2026-04

Architecture

Clean-room doctrine — external systems are autopsy specimens, never spine dependencies.

reports/CLEAN_ROOM_DOCTRINE.md2026-04

Closeout

2026-04-29 closeout — scored 9-item priority list, including failures.

reports/CLOSEOUT_2026_04_29_FINAL.md2026-04-29

See all 66 rendered reports → /r/ · Full 95-report archive → /downloads

Open Tools

Engines we will open up.

Native runtime · gigachad_native

Single C++/CUDA binary. No daemon, no service.

The compiled inference loop. mmaps .planck packs, runs CUDA forward, orchestrates organs. Acceptance suite 18/18 on llama.cpp backend, identity probe 14/14, integrity audit 10/10. Q4 7B at 5.55 GB VRAM, 83.58 tok/s production (llama.cpp) · 18.27 tok/s native default · 28.99 tok/s with DP4A flag · RTX 3060 Ti.

In production

Surgery toolkit · tools/surgery/*

QLoRA drivers, dataset forges, planck repackers.

Failure→repair pair forge, 7B-teacher → student data builders, QLoRA training drivers, adapter merge + planck repack. Every tool produces .planck / .jsonl / .md artefacts; runtime never imports them.

In active use

Memory spine · build_spine_index.py

305 files, 58 996 lines, line-addressable.

Indexing is shipped: every line of the spine has a sha256[:16] address. Exact-lookup CLI and TF-IDF semantic ranker are in build, not done. We say so explicitly.

Indexed; lookup CLI queued

Verifier

Hard checks: JSON schema, code compile, exit code, hash, structured fields.

Hard verifier in place across all production routes. Source-pointer-required gate runs on memory-anchored seeds (14/14 stretch passed); not yet enforced on free-form chat replies. Honest scope.

Hard checks live · source-pointer partial

Principles

How this laboratory operates.

01

Operate, don't retrain.

Where the field grows models by accumulation, we refine them by intervention. Local, targeted, measurable. We change the smallest set of parameters that produces the change we want, and we know which ones.

02

Compile what you ship.

Research code belongs in the laboratory. Production systems belong in compiled native code. The translation is not optional and it is not a future concern. It is the work.

03

Skeptical by default.

No claim leaves the system without a pointer to evidence. No memory is trusted without provenance. No operation is shipped without a reproducible benchmark. The default in this laboratory is doubt.

Honest Current Status

No GREEN without numbers.

Working

C++/CUDA runtime gigachad_native · single binary · single GPU.
Q4 Physarium-7B · 83.58 tok/s production (llama.cpp) · 18.27 native default · 28.99 with DP4A flag · 5.55 GB VRAM · RTX 3060 Ti.
Code-skeleton organ frozen: 13/100 MBPP B · 6/164 HumanEval B · 0/50 LCB · anchor 19/19 · fallback 0 · leaks 0.
TRIZ contradiction organ at 88/100 strict 6-field JSON, fallback 0.
phys05_json_repair (BD9) GREEN · 10 / 10 production failure modes repaired end-to-end.
phys05_claim_extractor (BD9) GREEN · clean structured-JSON output · 25-row training set.
Production state · 5 of 8 organs surgered (was 2 before BD9): code_skeleton · triz · wound v2 · json_repair · claim_extractor.
Acceptance suite 18/18 · identity probe 14/14 · architecture audit 10/10.
Memory spine indexed: 305 files · 58 996 lines · sha256[:16] per line.
Hologram cache 860 ms → 1 ms on identical input (860× speedup).
Repeat-learning round 2 on MBPP-20 ≥ PARROT (13 vs 12).
Terminal-NanoOS 30-task: 22/30 vs PARROT 20/30 (+2).

Not yet at production gate

Black-Dog conductance arbitration in Python harness only · C++ port queued.
Critic + wound rescue rate on ARIZ JSON: 0 across BD8 V1–V5 · wound v2 retained for in-chat rescue · BD8 retraining queued.
phys05_test_writer (BD9) YELLOW · pytest shape correct, currying confusion + Human-token leak · verifier should sanity-check argument count.
phys05_cache_matcher (BD9) YELLOW · correct integer answer + post-answer drift · max_tokens=16 caps the noise.
phys05_renderer (BD9) RED · output corrupted on free-form bash · loss ceiling 0.69 on 25-row training set · BD9.1 queued (50+ rows or r=16).
Memory exact-lookup CLI · not built.
Memory TF-IDF semantic ranker · not built.
GPQA Diamond runner · gated on dataset auth.
SWE-bench Lite runner · gated on Phase-12 NanoOS shell.
BFCL 3-mode runner · partial harness, not at scale.
Sovereign Cognition Gauntlet V1 · MONSTER 59/60 vs PARROT 60/60 · RED, recorded.

Reverted / dead-ends

BD6.2 (overtrain) · MBPP regressed 13 → 6 · reverted, pack pass-1 restored.
BD6.3–6.8D-rank · all failed one of {anchor 19/19, MBPP B ≥13, HE B ≥6, fallback 0} · reverted.
Anchor replication saturates around 53 % · BD6.x ceiling reached.
Topological memory (Phase-5) · A/B/C 200-query test lost to plain jaccard ngram (15.5 % vs 67 %) · demoted to advisory log.
AIME 2024 (full 30) · MONSTER 0/30 (PARROT 1/30) · Δ −1 model-class ceiling.

2026-05-03 · numbers carry a date · reverts stay visible

Want to see the other programs?

Back to all programs→

Run DeepSeek V4-Flashon your GPU.