OFFICIAL_BENCH_STACK — runner status (2026-05-02)
Goal: drive the FrankenLLM through public benchmarks where the numbers mean something to the outside world. Replace the toy MBPP/HE in-house splits with proper held-out subsets.
Scope
| benchmark | subset planned | mode A (7B only) | mode B (organ only) | mode C (organ + 7B) | runner status | |------------------|-------------------------|------------------|---------------------|---------------------|---------------| | MBPP | 100 official | done (history) | done (BD6 = 13/100) | done | wired | | HumanEval | 164 official | done (history) | done (BD6 = 6/164) | done | wired | | LiveCodeBench | 50 (subset v1) | done | done (0/50) | done | wired (LCB_CODE_ROUTE_FIX) | | BFCL | 50 hard / official | partial | partial | partial | runner exists, not at scale | | GPQA Diamond | 50 subset | not run | not applicable | not run | TODO runner | | SWE-bench Lite | 20 subset | not run | not applicable | not run | TODO runner | | Terminal-Bench | 30 subset (NanoOS) | done | n/a | done (V1 / V2) | wired |
What's blocking each "TODO runner"
- GPQA Diamond — need question-loader + multi-choice harness in
tools/bench/. Evaluator is exact-match on letter, easy. ~2 hours.
- SWE-bench Lite — need patch-apply + repo-clone + test runner per
task. Heavy. Probably gated on Phase-12 NanoOS shell capsule plus dedicated runner. ~1-2 days.
- BFCL hard — runner exists; needs official subset selection +
3-way mode harness (similar shape to MBPP/HE 3-mode runner).
What WILL be reported (one consolidated table)
When TRACK 6 fires:
benchmark subset A B C C-A wall organs_used BD_signal fb_count
MBPP 100 ?/? 13 ?/? … … phys05_code… food/poison …
HumanEval 164 ?/? 6 ?/? … … phys05_code… food/poison …
LiveCodeBench 50 ?/? 0 ?/? … … phys05_code… food/poison …
BFCL hard 50 ?/? … ?/? … … phys05_* food/poison …
GPQA Diamond 50 ?/? n/a ?/? … … physarium_7b — …
SWE-Lite 20 ?/? n/a ?/? … … multi-organ food/poison …
Terminal-Bench 30 ?/? n/a ?/? … … shell capsule food/poison …
A and B columns show single-shot accuracy in each mode. C-A shows the FrankenLLM lift over single-7B. wall, organs_used, BD_signal, fb_count come from DAG entries.
Why this isn't run YET
Strict order of operations from the master roadmap:
- TRACK 1 must produce real BD7 TRIZ pack (in flight).
- TRACK 2 must enable conductance routing (so C-mode actually
exercises BD signal).
- TRACK 4 must have critic+wound active (so C-mode tests the repair
loop, not just plain organ + 7B).
Running official bench BEFORE those tracks would only re-prove BD6 numbers (13/6/0). After them, B and C will both move.
Honest current numbers (do not display as final)
MBPP 100 A ~? / B 13 / C ?
HumanEval 164 A ~? / B 6 / C ?
LiveCodeBench 50 A ~? / B 0 / C ?
Anchor (BD6) 19/19 (organ alone, current decoder)
Re-run for the consolidated table happens in TRACK 6 fire.
File pointers
tools/bench/mbpp_he_3mode.py— MBPP/HE A/B/C runner (live)tools/bench/livecodebench_3mode.py— LCB A/B/C runner (live)tools/bench/triz_organ_bench.py— BD7 TRIZ runner (live)tools/bench/— directory for additions (gpqa, swe-lite, bfcl runners pending)reports/MBPP_HE_3MODE_V1.{md,json}— last live MBPP/HE resultreports/LIVECODEBENCH_3MODE_V1.{md,json}— last live LCB resultreports/BD7_TRIZ_BASELINE_T2_N100.json— TRIZ pre-surgery baseline (0/100)