CyberdyneLabs · Reports · BENCH_CLEANUP_AND_OFFICIAL_RUN

BENCH_CLEANUP_AND_OFFICIAL_RUN — final truth table (2026-05-01)

reports/BENCH_CLEANUP_AND_OFFICIAL_RUN.md 1688 words raw markdown ↗

BENCH_CLEANUP_AND_OFFICIAL_RUN — final truth table (2026-05-01)

Single source of truth for the four-task project. All TASK 1+2 numbers below are real, full-N runs; TASK 3 status per-bench in the table.


TASK 1 — fix json_repair classifier quirk  ✅ DONE

DOD: MONSTER_INTEGRATION_V1 route=8/8, DAG/BD on active non-cache routes=6/6, json_repair uses phys05_json_repair, no fallthrough json → ariz_e2e, every active route has organs_used + BD signal.

Result (reports/MONSTER_INTEGRATION_V1.md):


TASK 2 — real public coding benches × 3 modes  ✅ DONE

Modes:

Bench: tools/bench/mbpp_he_3mode.py, single --chat call per prompt, all runtime-side cleverness, no Python-side retry.

Full N results

| bench | n | A pass | B pass | C pass | C−A | C−B | A wall | B wall | C wall | B organs_used | A/C fb_count | B BD_signal | |-------------|-----|--------|--------|--------|-----|-----|-----------|----------|-----------|-----------------------|--------------|-------------| | MBPP | 100 | 60 | 6 | 60 | 0 | +54 | 5362 s | 550 s| 5353 s | phys05_code_skeleton| 99 / 99 | 57 | | HumanEval | 164 | 81 | 2 | 81 | 0 | +79 | 8626 s | 1083 s| 8629 s | phys05_code_skeleton| 164 / 164 | 104 |

(fb_count = rows that touched physarium_7b* per the harness's substring match. B BD_signal = rows whose envelope carried a dag link with food/poison/conductance written.)

What this measurement reveals (the answer to user's TASK 5 constraints)

  1. NO_7B_FALLBACK=1 works end-to-end. Mode B's organs_used_set = {phys05_code_skeleton} only — zero physarium_7b* leaked across 264 code prompts. The new gates in run_chat_organ_route, run_chat_json_repair, run_chat_ariz_organ_first, the unsupported-route branch, and the run_ariz_e2e fall-through all correctly emit "no 7B" envelopes when the gate is set.
  1. Raw 0.5B competence floor. On real public-test benchmarks the 0.5B organ alone scores 6 % MBPP, 1.2 % HumanEval. The organ chain emits syntactically-broken code on ~80 % of HumanEval docstring-completion prompts. 0.5B cannot solve coding problems alone. That's the honest answer: B reveals organs are not useful in isolation for code generation at this model size.
  1. A and C are identical. 60/60 on MBPP, 81/81 on HumanEval — at exactly the same wall time (5362 vs 5353 s; 8626 vs 8629 s). The 7B fallback dominates pass-rate; the organ contributes nothing measurable when 7B is reachable. At this 0.5B model quality, the C-A delta is 0. Surgery (BD3-poison-driven QLoRA → BD6) is the only path to a non-zero C-A delta.
  1. Wall-time advantage of the architecture. Mode B's wall is ~10× faster than A/C (550 s vs 5362 s on MBPP; 1083 s vs 8626 s on HumanEval). When the 0.5B can solve a task (~6/100 MBPP, ~2/164 HumanEval — mostly the trivial cases), it does so in 4–6 s vs 25–30 s for the 7B path. So the production C path is "use 7B for everything, get exactly A's accuracy at exactly A's wall." That's the reality.
  1. BD signal is real and per-task in B mode. 57 MBPP and 104 HumanEval rows in Mode B emit a DAG entry with food / poison / conductance_before / conductance_after. The poison rows — e.g. all 158 MBPP failures and 162 HumanEval failures in B — populate the BD3 poison dataset for the next surgery pass. This is the pipeline the user designed: organs fail honestly → poison feeds organ_qlora_surgery → next-pass organ rises.

Honest caveats


TASK 3 — 3 official benches  ⏳ ONE RUNNING, TWO BLOCKED-BLOCKER-EXACT

| bench | dataset access | harness | status | |---------------------------|-------------------------------------------------------|-----------------------------------------------------|---------------| | LiveCodeBench | ✅ livecodebench/code_generation (HF, public) | tools/bench/livecodebench_3mode.py (✅ done) | DONE — A/B/C all 0/50 on difficulty=easy | | BFCL official hard | ❌ gorilla-llm/BFCL is not an HF dataset; needs bfcl-eval Python package + a local OpenAI-shape adapter for --chat | not written | BLOCKED — DEPENDENCY (pip install bfcl-eval + ~50 lines FastAPI shim around ./build/gigachad_native --chat) | | GPQA Diamond | ❌ Idavidrein/gpqa gated on HF | not written (≈40-line MCQ harness once data loads) | BLOCKED — AUTH (huggingface-cli login + accept gated terms in browser, then HF_TOKEN=…) |

The exact errors are documented in reports/TASK3_OFFICIAL_BENCHES_STATUS.md. No hand-made substitutes — TASK 5 explicitly bans those (the existing tools/bench/bfcl_subset.py is a 10-question hand-made smoke and is not counted toward TASK 3).


TASK 4 — single unified table

| benchmark | subset | A score | B score | C score | C−A | C−B | A wall | B wall | C wall | B organs_used | A/C fb | B BD | notes | |------------------------|--------------|---------|---------|---------|-----|------|--------|--------|--------|-----------------------|--------|------|-------| | MBPP | n=100 | 60/100 | 6/100 | 60/100 | 0 | +54 | 5362 s | 550 s | 5353 s | phys05_code_skeleton | 99/99 | 57 | A=C identical; raw 0.5B floor 6% | | HumanEval | n=164 (full) | 81/164 | 2/164 | 81/164 | 0 | +79 | 8626 s | 1083 s| 8629 s | phys05_code_skeleton | 164/164 | 104 | A=C identical; 0.5B ~0 % on docstring shape | | LiveCodeBench | easy n=50 | 0/50 | 0/50 | 0/50 | 0 | 0 | 3199 s | 7 s | 1180 s | phys05_triz_contradiction (B); phys05_triz/claim+7b (C) | 50/49 | 48 (C) | model can't produce valid stdin/stdout competitive programs at this size | | BFCL official hard | — | — | — | — | — | — | — | — | — | — | — | — | BLOCKED — dependency bfcl-eval | | GPQA Diamond | — | — | — | — | — | — | — | — | — | — | — | — | BLOCKED — auth (gated dataset) |

JSON: reports/bench_cleanup_and_official_run.json mirrors this table exactly.

Reading the table per TASK 5 constraints


What's GREEN, what's YELLOW, what's RED

| component | status | |--------------------------------------------------|--------| | TASK 1 json_repair fall-through fix | GREEN | | MONSTER_INTEGRATION_V1 unified bench (8/8 + 6/6) | GREEN | | NO_7B_FALLBACK gate (organ_route, json_repair, ariz_organ_first, run_ariz_e2e fall-through, unsupported route) | GREEN | | 3-mode harness mbpp_he_3mode.py | GREEN | | MBPP n=100 × 3 modes | GREEN data (full N) | | HumanEval n=164 × 3 modes | GREEN data (full N) | | LiveCodeBench easy n=50 × 3 modes | GREEN data (all modes 0/50; model too small for atcoder problems) | | BFCL official hard subset | YELLOW — BLOCKED on bfcl-eval install + adapter | | GPQA Diamond | YELLOW — BLOCKED on HF auth | | Single unified table | GREEN (this file) |

The architectural finding from full-N data is unambiguous: at the current 0.5B model quality, organs add wall-time speedup but no accuracy on real code benches. The pipeline that makes the C-A delta move is organ surgery (BD3 poison → BD6 QLoRA), not more runtime plumbing. That is the next-phase scope.


Files touched