OFFICIAL_BENCH_STACK — runner status (2026-05-02)

Goal: drive the FrankenLLM through public benchmarks where the numbers mean something to the outside world. Replace the toy MBPP/HE in-house splits with proper held-out subsets.

Scope

| benchmark | subset planned | mode A (7B only) | mode B (organ only) | mode C (organ + 7B) | runner status | |------------------|-------------------------|------------------|---------------------|---------------------|---------------| | MBPP | 100 official | done (history) | done (BD6 = 13/100) | done | wired | | HumanEval | 164 official | done (history) | done (BD6 = 6/164) | done | wired | | LiveCodeBench | 50 (subset v1) | done | done (0/50) | done | wired (LCB_CODE_ROUTE_FIX) | | BFCL | 50 hard / official | partial | partial | partial | runner exists, not at scale | | GPQA Diamond | 50 subset | not run | not applicable | not run | TODO runner | | SWE-bench Lite | 20 subset | not run | not applicable | not run | TODO runner | | Terminal-Bench | 30 subset (NanoOS) | done | n/a | done (V1 / V2) | wired |

What's blocking each "TODO runner"

GPQA Diamond — need question-loader + multi-choice harness in

tools/bench/. Evaluator is exact-match on letter, easy. ~2 hours.

SWE-bench Lite — need patch-apply + repo-clone + test runner per

task. Heavy. Probably gated on Phase-12 NanoOS shell capsule plus dedicated runner. ~1-2 days.

BFCL hard — runner exists; needs official subset selection +

3-way mode harness (similar shape to MBPP/HE 3-mode runner).

What WILL be reported (one consolidated table)

When TRACK 6 fires:

benchmark      subset    A     B     C    C-A   wall   organs_used     BD_signal   fb_count
MBPP           100      ?/?   13   ?/?   …     …      phys05_code…   food/poison   …
HumanEval      164      ?/?    6   ?/?   …     …      phys05_code…   food/poison   …
LiveCodeBench   50      ?/?    0   ?/?   …     …      phys05_code…   food/poison   …
BFCL hard       50      ?/?   …    ?/?   …     …      phys05_*        food/poison   …
GPQA Diamond    50      ?/?   n/a  ?/?   …     …      physarium_7b    —             …
SWE-Lite        20      ?/?   n/a  ?/?   …     …      multi-organ     food/poison   …
Terminal-Bench  30      ?/?   n/a  ?/?   …     …      shell capsule   food/poison   …

A and B columns show single-shot accuracy in each mode. C-A shows the FrankenLLM lift over single-7B. wall, organs_used, BD_signal, fb_count come from DAG entries.

Why this isn't run YET

Strict order of operations from the master roadmap:

TRACK 1 must produce real BD7 TRIZ pack (in flight).
TRACK 2 must enable conductance routing (so C-mode actually

exercises BD signal).

TRACK 4 must have critic+wound active (so C-mode tests the repair

loop, not just plain organ + 7B).

Running official bench BEFORE those tracks would only re-prove BD6 numbers (13/6/0). After them, B and C will both move.

Honest current numbers (do not display as final)

MBPP 100         A  ~? / B  13 / C  ?
HumanEval 164    A  ~? / B   6 / C  ?
LiveCodeBench 50 A  ~? / B   0 / C  ?
Anchor (BD6)            19/19 (organ alone, current decoder)

Re-run for the consolidated table happens in TRACK 6 fire.

File pointers

tools/bench/mbpp_he_3mode.py — MBPP/HE A/B/C runner (live)
tools/bench/livecodebench_3mode.py — LCB A/B/C runner (live)
tools/bench/triz_organ_bench.py — BD7 TRIZ runner (live)
tools/bench/ — directory for additions (gpqa, swe-lite, bfcl runners pending)
reports/MBPP_HE_3MODE_V1.{md,json} — last live MBPP/HE result
reports/LIVECODEBENCH_3MODE_V1.{md,json} — last live LCB result
reports/BD7_TRIZ_BASELINE_T2_N100.json — TRIZ pre-surgery baseline (0/100)