# OFFICIAL_BENCH_STACK — runner status (2026-05-02)

Goal: drive the FrankenLLM through public benchmarks where the numbers
mean something to the outside world. Replace the toy MBPP/HE in-house
splits with proper held-out subsets.

## Scope

| benchmark        | subset planned          | mode A (7B only) | mode B (organ only) | mode C (organ + 7B) | runner status |
|------------------|-------------------------|------------------|---------------------|---------------------|---------------|
| MBPP             | 100 official            | done (history)   | done (BD6 = 13/100) | done                | wired |
| HumanEval        | 164 official            | done (history)   | done (BD6 = 6/164)  | done                | wired |
| LiveCodeBench    | 50 (subset v1)          | done             | done (0/50)         | done                | wired (LCB_CODE_ROUTE_FIX) |
| BFCL             | 50 hard / official      | partial          | partial             | partial             | runner exists, not at scale |
| GPQA Diamond     | 50 subset               | not run          | not applicable      | not run             | TODO runner |
| SWE-bench Lite   | 20 subset               | not run          | not applicable      | not run             | TODO runner |
| Terminal-Bench   | 30 subset (NanoOS)      | done             | n/a                 | done (V1 / V2)      | wired |

## What's blocking each "TODO runner"

* **GPQA Diamond** — need question-loader + multi-choice harness in
  `tools/bench/`. Evaluator is exact-match on letter, easy. ~2 hours.
* **SWE-bench Lite** — need patch-apply + repo-clone + test runner per
  task. Heavy. Probably gated on Phase-12 NanoOS shell capsule plus
  dedicated runner. ~1-2 days.
* **BFCL hard** — runner exists; needs official subset selection +
  3-way mode harness (similar shape to MBPP/HE 3-mode runner).

## What WILL be reported (one consolidated table)

When TRACK 6 fires:

```
benchmark      subset    A     B     C    C-A   wall   organs_used     BD_signal   fb_count
MBPP           100      ?/?   13   ?/?   …     …      phys05_code…   food/poison   …
HumanEval      164      ?/?    6   ?/?   …     …      phys05_code…   food/poison   …
LiveCodeBench   50      ?/?    0   ?/?   …     …      phys05_code…   food/poison   …
BFCL hard       50      ?/?   …    ?/?   …     …      phys05_*        food/poison   …
GPQA Diamond    50      ?/?   n/a  ?/?   …     …      physarium_7b    —             …
SWE-Lite        20      ?/?   n/a  ?/?   …     …      multi-organ     food/poison   …
Terminal-Bench  30      ?/?   n/a  ?/?   …     …      shell capsule   food/poison   …
```

A and B columns show single-shot accuracy in each mode.
C-A shows the FrankenLLM lift over single-7B.
wall, organs_used, BD_signal, fb_count come from DAG entries.

## Why this isn't run YET

Strict order of operations from the master roadmap:
1. TRACK 1 must produce real BD7 TRIZ pack (in flight).
2. TRACK 2 must enable conductance routing (so C-mode actually
   exercises BD signal).
3. TRACK 4 must have critic+wound active (so C-mode tests the repair
   loop, not just plain organ + 7B).

Running official bench BEFORE those tracks would only re-prove BD6
numbers (13/6/0). After them, B and C will both move.

## Honest current numbers (do not display as final)

```
MBPP 100         A  ~? / B  13 / C  ?
HumanEval 164    A  ~? / B   6 / C  ?
LiveCodeBench 50 A  ~? / B   0 / C  ?
Anchor (BD6)            19/19 (organ alone, current decoder)
```

Re-run for the consolidated table happens in TRACK 6 fire.

## File pointers

* `tools/bench/mbpp_he_3mode.py` — MBPP/HE A/B/C runner (live)
* `tools/bench/livecodebench_3mode.py` — LCB A/B/C runner (live)
* `tools/bench/triz_organ_bench.py` — BD7 TRIZ runner (live)
* `tools/bench/` — directory for additions (gpqa, swe-lite, bfcl runners pending)
* `reports/MBPP_HE_3MODE_V1.{md,json}` — last live MBPP/HE result
* `reports/LIVECODEBENCH_3MODE_V1.{md,json}` — last live LCB result
* `reports/BD7_TRIZ_BASELINE_T2_N100.json` — TRIZ pre-surgery baseline (0/100)