Surgery Chronology · 2026-04-21 → 2026-05-05
Every claim has a date.
Every number has a file.
From a 284 B-total / 13 B-active MoE running on a single 8 GB GPU, through the first organic weight surgery on Qwen 0.5B, through the Phase 8E speed arc that took us from 1.91 tok/s to 83.58 tok/s, through eight reverted BD6 surgery passes — this page is the unedited record. Reverts stay visible. Errata stay flagged.
Eras
V4-Flash flagship — 284B on a single 8 GB GPU.
Subject: DeepSeek-V4-Flash · MoE FP4 routed experts + FP8 backbone · ~284 B total parameters (13 B active per token) on disk · 13 B active per token
The flagship demonstration. DeepSeek-V4-Flash — a frontier-grade open MoE artefact the field assumed required several data-centre GPUs — was driven through end-to-end inference on a single RTX 3060 Ti, 8 GB VRAM, 13 GB system RAM, 80 GB swap, WSL2. This was not a stunt. It was the phase that produced every piece of infrastructure the rest of the lab now runs on: planck packs, the Singularity Monolith VRAM pool, the hot-expert cache, the singularity index, the C-extensions for packed-q8 decode. Calling V4-Flash an "autopsy archive" understates what it was. It was the source.
planck_core.so/.dll · planck_core_v3.soOptimization passes — 19 measured improvements (and one honest negative result)
shared_experts → RAM streaming.FP8_E8M0 → FP32 in scale expansion (kernel-level).Reference correctness — Paris top-1 verified
"What is the capital of France? Answer in one word." (DeepSeek-V4 chat-template wrapped)8 architecture findings that broke our own ports
<|Assistant|></think> is mandatory. Without it the model emits garbage; the closing </think> selects non-thinking mode.tid2eid table, but weights still come from sqrt(softplus(x·Wᵀ)) + gather + normalise × 1.5. Uniform 1/top_k destroys magnitudes.(y_raw, scale); inplace form returns y·scale. Confusing them = silent 1000× underflow in MoE.wkv/wgate/ape into a-portion (previous chunk overlap) and b-portion (current). HCA (m=128) has no overlap.[−10, +10]; gate capped at +10 only, no lower bound. The asymmetry is intentional.wq_a → q_norm → wq_b, apply per-head RMS with no learned weight: q *= rsqrt(q.square().mean(-1) + eps). Skipping degrades head specialisation silently.
All eight findings, the full reference Python pipeline, and per-layer activation dump tooling are now open-source: surgery → DeepSeek V4-Flash open-source reference. compare_dumps.py is the test oracle for any port.
sources: folder/V4_FLASH_TECH_BRIEF.md · folder/PYTHON_PIPELINE_DOC.md · folder/flash_mvp.py · folder/dump_ref_v4.py · folder/compare_dumps.py · folder/v4_download.log · folder/docs/HISTORY_TREE.md §1
Physarum-05B-Organic — the first weight surgery.
Engine: folder/physarum_engine.cpp · 137 lines C++17 · energy-conserving softmax flow · organic threshold D < mean·0.1
The first real weight surgery in the lab. A 137-line C++17 engine performed organic, flow-based pruning on Qwen 2.5 0.5B, killing 20.6 % of weights without changing the file size on disk. We measured what survived and what did not — and we publish both, not only the survivors.
folder/Physarum-05B-Organic/model.safetensors · 988 MB (≈ donor + 200 B header)Honest delta on hard tasks (measured · not estimated)
| Axis | Before | After | Δ | Note |
|---|---|---|---|---|
| Perplexity (raw) | 27.16 | 31.32 | +15.3 % | final_results.json measured value · brief had +12.5 %, see errata below |
| Throughput (tok/s) | 27.15 | 27.55 | no regression | preserved within noise |
| MMLU-mini | 90 % | 70 % | −22 % | hard-task degradation |
| GSM8K-mini | 100 % | 80 % | −20 % | hard-task degradation |
| JSON-repair smoke | 100 % | 100 % | 0 | no regression |
| Code-skeleton smoke | 100 % | 100 % | 0 | no regression |
| Disk size | 988 MB | 988 MB | +0 % | same shape, same shards |
| VRAM | baseline | +1 % | noise | — |
| decode_128 | baseline | +14 % | within noise | — |
+12.5 % on this surgery. The measured value in folder/final_results.json is +15.3 % (27.16 → 31.32). The +12.5 % figure does not appear in any source file and is withdrawn.
sources: folder/physarum_engine.cpp · folder/Physarum-05B-Organic/ · folder/final_results.json · folder/reports/TRUTH_LEDGER.md §B · weight_diff.json · sparsity_pattern.json
GIGACHAD Phase 6 → Phase 13 — the native runtime arc.
The bulk of the work. The native runtime was consolidated, a 7B top-brain was operated on, the Singularity / planck pack format was implemented end-to-end, the speed arc carried us from 1.91 tok/s to over 83 tok/s, the chat path was verified, the Black-Dog reinforcement loop was wired, and BD-series organ surgery began. Every phase below maps to a report file; all sub-step numbers come from that report's measurements.
Phase 6 — Native consolidation · 2026-04-27
TRUTH_LEDGER.md first written. Phase 7 — Physarium-7B top-brain surgery · 2026-04-27
PHYSARIUM_RESULTS_RECONCILE.md and PHYSARIUM_COVERAGE_AUDIT.md.
sources: reports/PHYSARIUM7B_SURGERY_REPORT.md · reports/PHYSARIUM_COVERAGE_AUDIT.md · reports/PHYSARIUM_RESULTS_RECONCILE.md
Phase 8A → 8D — Planck format & first E2E
PLANCK7B_PACK format · writer · reader · verifier. Byte-for-byte verify 50/50 PASS.organ_manager + planck_runner wiring.Phase 8E — The Speed Arc ✦
--backend cuda via organ_manager. json_repair 2.4 s vs 49.2 s CPU = 20×.sources: reports/HYPERSPEED_8E5.md · reports/PHASE_8E8A_DP4A_NATIVE_BACKEND.md · reports/EXTERNAL_BACKEND_SHOOTOUT_V2.md · reports/EXTERNAL_SHOOTOUT_8E7.md · docs/HISTORY_TREE.md §5
Phase 8F — Decoder controls, calibration, identity gate · 2026-04-27 → 04-28
identity_version. Identity probe regression 6 questions, DOD ≥ 5/6.gigachad_regression_native.Phase 9 — Fusion, parallel, identity LoRA · 2026-04-28 → 04-29
--chat · Q4 resident wired to physarium_7b organ.Phase 10 — Universal LLM surgery protocol · 2026-04-29
Universal-LLM surgery protocol document drafted — generalisation of the lab's gating doctrine across donor families.
Phase 11 — Acceptance run · 2026-04-29
--chat.Phase 12 — NanoOS substrate · hologram cache · code-repair loop · 2026-04-29 → 04-30
chat_context_builder · scrolls → system_msg.run_chat. 860 ms → 1 ms = 860× speedup on identical-prompt repeats.sources: reports/EXACT_REPLAY_CACHE_V1.md · reports/HOLOGRAM_REPLAY_X100.md · reports/TERMINAL_NANOOS_30.md · docs/PHASE_12_NANO_OS_EXECUTION_SUBSTRATE.md
Phase 13 — Black-Dog organ colony · 2026-04-30 → 2026-05-03
phys05_code_skeleton kept · 13/100 MBPP, 6/164 HE, anchor 19/19.phys05_triz_contradiction · 88/100 strict 6-field JSON, fallback 0.phys05_json_repair surgery · 10 / 10 GREEN on production failure catalog · first organ at 100 % on a real failure-bench · 280 synthetic rows, loss 0.055 → 0.0003 · comma_as_colon (the BD8 wound-v2 quirk) handled natively.dag/capsules/cap_*.json Terminal-NanoOS task corpus · 9 epochs OR r=16 (one lever, not both) · prompt template tightening to suppress Human: donor-token leak.sources: reports/BD6_POST_SURGERY_DELTA.md · reports/BD6_2_OVERTRAIN_DELTA.md · reports/BD6_8D_RANK_FINAL_FREEZE.md · reports/BD7_TRIZ_SURGERY_FINAL.md · reports/RUNTIME_ORGANISM_BENCH_V1.md · reports/BD9_JSON_REPAIR_FINAL.md · reports/BD9_FOUR_ORGANS_FINAL.md · docs/PHASE_13_BLACK_DOG_ORGAN_COLONY.md
CLEAN-ROOM doctrine — 2026-04-29 → 04-30
sources: reports/CLEAN_ROOM_DOCTRINE.md · reports/EXTERNAL_BACKEND_SHOOTOUT_V2.md
From 1.91 tok/s to 83.58 tok/s.
The headline arc — every step measured on the same RTX 3060 Ti. Each row is a milestone in the native runtime; each number has a report file. The first six rows are the climb; the last row is the production ceiling, achieved by treating llama.cpp as a clean-room autopsy and porting its kernels into our own backend.
| Phase / config | Speed | Vs prev | Note |
|---|---|---|---|
| V4-Flash 284B PyTorch warm decode | p50 9.6 s/tok | — | flagship demo · 8 GB VRAM |
| Physarum-05B-Organic baseline | 27.15 tok/s | — | 0.5B BF16 baseline |
| CPU baseline · 0.5B | 1.91 tok/s | — | reference floor for CPU path |
| CUDA full GPU 0.5B (Phase 8E.1) | 116 tok/s | 61× CPU | byte-identical to CPU |
| CUDA fused 7B BF16 streaming (8E.2) | 0.20 tok/s | — | correctness proof, not main path |
| Q4 NUCLEAR resident 7B (8E2) | 11.16 tok/s | 280× CPU baseline | 5.55 GB Q4 group=128 · 28 layers in VRAM |
Q4 native v2 default --chat | 18.27 tok/s | +64 % over NUCLEAR | — |
| Q4 native + DP4A=1 (opt-in) | 28.99 tok/s | +59 % | — |
| Q4 native + DP4A · tg128 | 41.69 tok/s | +44 % | — |
| llama.cpp backend (LLAMACPP_URL) | 83.58 tok/s | +100 % | production speed · clean-room autopsy |
| Mode C llama.cpp acceptance · mean wall | 2.99 s | — | per query, 18-task suite |
sources: reports/EXTERNAL_BACKEND_SHOOTOUT_V2.md · reports/PHASE_8E8A_DP4A_NATIVE_BACKEND.md · reports/CURRENT_TRUTH_LEDGER.md §2 · 5-run mean on RTX 3060 Ti, Physarium-7B Q4
Nine runs in a row, no regression.
The 18-task curated acceptance suite is the integrity gate every change has to pass. Identity probe, organ-leaks gate, structural verifier, end-to-end JSON / code / claim / memory routes. No surgery merges if this regresses. Below is the ladder we have shipped.
| Run | Result | Identity | Leaks | Wall | Note |
|---|---|---|---|---|---|
| v14 · llama.cpp backend | 18/18 | 14/14 | 0 | 2.99 s | production ceiling · CURRENT_TRUTH_LEDGER §1 |
| v15 · DP4A native | 17/18 | — | 0 | — | opt-in flag · close to 18/18 with G2.b/G3/G4 fixes |
| v16 · after Gap C close | 18/18 | — | 0 | — | Gap C kill · GAP_C_KILL_FINAL.md |
| v17 · llamacpp after Gap C | 18/18 | — | 0 | — | — |
| v18 · after G3 (Python compile probe) | 18/18 | — | 0 | — | verifier hardening · runtime Python compile probe |
| v19 · after holographic form replay | 18/18 | — | 0 | — | HOLOGRAPHIC_FORM_REPLAY_V1.md |
| v20 · after native CR | 18/18 | — | 0 | — | code-repair loop in native runtime |
| v21 · anchored preamble | 18/18 | — | 0 | — | identity anchor in system message |
| v22 · post anchor | 18/18 | — | 0 | — | stable post-9F LoRA anchor |
CURRENT_TRUTH_LEDGER.md §1. Runs v15 → v22 are listed in the master HISTORY_TREE.md trajectory. Some intermediate JSON snapshots are in reports/; the per-run JSON files for v16–v22 are recorded in master logs but not all preserved as standalone files. Treat the ladder as "every shipped change passed acceptance, recorded contemporaneously" — the headline is the trajectory, not any individual JSON.
sources: reports/gigachad_acceptance_run_v14_llamacpp.json · reports/CURRENT_TRUTH_LEDGER.md §1 · docs/HISTORY_TREE.md §5 / §6
The doctrine, in date order.
The slogans on the program pages — "no GREEN without numbers", "reverts recorded in full", "external systems are autopsy specimens" — are not branding. Each one has a doctrine document, written on a specific day, for a specific reason. Below is the chronology.
| Document | First written | Established that |
|---|---|---|
TRUTH_LEDGER.md | 2026-04-27 | A/B/C/D categorisation of every claim — measured · surgery · scaffold · unsafe |
ARCHITECTURE_LOCK.md | ~2026-04-27 | Donor used as DONOR ONLY — no in-runtime cross-talk |
PHYSARIUM_RESULTS_RECONCILE.md | 2026-04-27 | v1 errata · denominator audit · v1 magnitude-flow vs activation-aware distinction |
PHYSARIUM_COVERAGE_AUDIT.md | 2026-04-27 | Tile coverage 100 % · kill-rate denominator framework |
GIGACHAD_LAB_MASTER_REPORT.md | 2026-04-27 | Master single source · every new report appends "UPDATE TO MASTER REPORT" |
ARIZ_KERNEL.md | — | ARIZ / TRIZ reasoning kernel spec |
BLACK_DOG_LEARNING_LOOP.md | — | Food / poison reinforcement loop spec |
CLEAN_ROOM_DOCTRINE.md | 2026-04-29 | External systems = patients, never spine. llama.cpp / EXL2 / AWQ / Claude Code — autopsy only |
CURRENT_TRUTH_LEDGER.md | 2026-04-29 | Most recent SoT replaces TRUTH_LEDGER · live updates land here first |
PHASE_12_NANO_OS_EXECUTION_SUBSTRATE.md | 2026-04-29 | NanoOS spec · capsule sandbox for terminal evaluation |
PHASE_13_BLACK_DOG_ORGAN_COLONY.md | 2026-04-30 | Organ colony spec · BD-series surgery model |
HISTORY_TREE.md | 2026-04-30 | Single source of "what happened when" · the chronological backbone |
X100_SCOREBOARD.md | — | Repeat-learning ×100 maintenance ledger |
all documents under folder/docs/ and folder/reports/ · primary chronological backbone: folder/docs/HISTORY_TREE.md
The first independent scorecard.
2026-05-09 → 2026-05-18 · 9 days · 3 jumps · 1 hybrid milestone · 1 closed loop closed
The Surgery / Frankenstellm chronicle (Eras 1–4) shipped a runtime; the doctrine era (Era 6) wrote the rules. Era 7 is the first independently scorecarded cognitive milestone — ADAM, the long-horizon program, climbing the public ARC-AGI-3 leaderboard from a cold entry at 22 / 183 to #1 published on both tracks in nine days. The signal is not the percentage. The signal is the closed loop: experience changed memory, memory changed procedure selection, procedure selection improved the next run.
| Date | Score | Method delta | Why it moved |
|---|---|---|---|
| 2026-05-09 | 22 / 183 · 12.02 % | Substrate + scoring loop only | First leaderboard entry. Pure self-play, no policy training, no source reading. |
| 2026-05-14 | 24 / 183 · 13.11 % | + Rudakov-style graph explorer | Richer trajectory expansion under the same scoring loop. Same day: hybrid harness reached 183 / 183 = 100.00 % (25/25 envs, 6 537 actions) and was scorecarded. |
| 2026-05-18 | 25 / 183 · 13.66 % | + warmed world_model · + substrate-explore-fallback | Persistent substrate learning across runs. World_model carries causal_bias and delta_mv from prior attempts; fallback hands frontier control to lg_quantum_think when beam saturates. |
Independent verification at arcprize.org/scorecards/6a5888ac-21e1-40b9-abac-5fecbe62cb42 — we do not control this URL. Compute: single consumer NVIDIA RTX 3060 Ti, 8 GB VRAM. No data centre, no cloud, no external API. The hybrid 183 / 183 included two human boss-level demonstrations for the hardest 2 levels (bp35 L8, wa30 L8), disclosed inside the scorecard. The autonomous 25 / 183 is the LLM-free, human-free figure.
primary references: folder/docs/ARC_AGI3_PHASE_162.md · folder/reports/ARC_AGI3_SUBSTRATE_CLIMB.md · public pages: /adam ARC section, /arc-agi-3 leaderboard
This page is the record.
The two program pages — SURGERY and FRANKENSTELLM — describe the lab and the organism as they exist today. This page describes the trajectory that produced them, day by day, with reverts, errata, and dead-ends still visible. Numbers without a date are slogans; numbers with a date are evidence.