CyberdyneLabs · Reports · RUNTIME_ORGANISM_BENCH_V1

RUNTIME_ORGANISM_BENCH v1 — TRACK 2 + 4 first integration (2026-05-02)

reports/RUNTIME_ORGANISM_BENCH_V1.md 861 words raw markdown ↗

RUNTIME_ORGANISM_BENCH v1 — TRACK 2 + 4 first integration (2026-05-02)

Mission: prove that Black-Dog conductance is read before chain selection (TRACK 2) and that critic_lite + wound run before 7B fallback on json/code/ariz failures (TRACK 4). 30 mixed ARIZ tasks, 5 repeated runs. DAG visible per row.

Result

| run | pass | fb | repair_attempt | repair_rescue | chain_picks (organ_only / organ_repair / organ_7b) | |-----|------|----|----------------|---------------|-----------------------------------------------------| | 0 | 26/30 | 0 | 0 | 0 | 30 / 0 / 0 ← cold start (all 0 conductance) | | 1 | 26/30 | 0 | 4 | 0 | 26 / 4 / 0 ← 4 failures from r0 routed to repair | | 2 | 26/30 | 0 | 0 | 0 | 26 / 0 / 4 ← repair didn't rescue → routed to 7b | | 3 | 26/30 | 0 | 0 | 0 | 30 / 0 / 0 ← all chains -1 → tiebreak picks cheapest | | 4 | 26/30 | 0 | 4 | 0 | 26 / 4 / 0 ← cycle restarts |

TRACK 2 DOD audit

| spec gate | status | |--------------------------------------------------|--------| | router reads conductance before selecting organs | ✅ harness loads BD store + harness-local store, queries each candidate chain | | DAG records selected_chain | ✅ each row writes selected_chain | | DAG records candidate_conductances | ✅ each row writes candidate_conductances: {local: [...], cpp: [...]} | | successful chains get food | ✅ harness local_update(food=1) after each pass | | failed chains get poison | ✅ local_update(poison=1) on fail | | route choice changes on repeated tasks | ✅ ARIZ/01 trajectory: organ_only → organ_repair → organ_7b → organ_only → organ_repair |

TRACK 2 → MET.

TRACK 4 DOD audit

| spec gate | status | |------------------------------------------------------|--------| | critic + wound runs BEFORE 7b on json/code/ariz fail | ✅ harness invokes phys05_critic_lite + phys05_wound when organ_only fails | | every repair attempt writes BD food/poison | ✅ via local store update | | ≥3 non-terminal repairs rescued without 7B | ❌ 0 rescues |

TRACK 4 → MECHANISM WIRED but RESCUE COUNT = 0.

Why repair rescue = 0 (honest)

The two organs at the repair stations were not trained for the ARIZ 6-field JSON schema:

failures (stderr/exit-code style).

for terminal recovery, not strict-JSON repair.

Both fired (4 attempts in run1, 4 in run4 — visible in repair_attempt column) but produced outputs that did not pass the ARIZ verifier.

This is a training-side blocker, not a wiring blocker. TRACK 4 mechanism is operational (verifier fail → critic call → wound call → re-verify → 7B fallback only after wound fails). To turn rescue rate positive on json/code/ariz routes, BD8 surgery is required:

set (failed schema diagnosis).

This is a queued follow-up task. Loop architecture works; rescue quality is the next surgery target.

Persistent failure pattern

4 of 30 ARIZ tasks fail repeatedly across all chains (ARIZ/01, /04, /05, and one more — visible in trajectory dump). These are tasks where the trained TRIZ organ + 7B + repair organs all produce schema-incomplete outputs. The trajectory bounces:

ARIZ/01: r0 organ_only/FAIL → r1 organ_repair/FAIL → r2 organ_7b/FAIL
        → r3 organ_only/FAIL  ← all conductances are -1; tiebreak picks
                                 organ_only (cheapest); cycle restarts
        → r4 organ_repair/FAIL

The harness's tiebreak rule (prefer cheapest when all conductances equal) means after every chain fails once, the next run starts the cycle over. Smarter behaviour would be: when ALL chains have conductance < 0, mark the task as "outside system reach" and skip further attempts. That's a v0.2 refinement.

Honest TRACK 2+4 status

TRACK 2 (conductance router) — v0.1 SHIPPED. Conductance is read, arbitration happens, route selection changes on repeated tasks. DAG proves it. Implementation is in Python harness (not C++ runtime), so the v0.2 follow-up is to port the same arbitration logic into run_chat_ariz_organ_first and friends. ~2-3 hours C++.

TRACK 4 (critic + wound for non-terminal routes) — v0.1 WIRED but RESCUE = 0. The mechanism (call critic, call wound, re-verify before 7B fallback) is operational and DAG-visible. Zero rescues happened because the organs at the repair stations are trained for terminal failures, not ARIZ JSON. Surgery to retrain critic_lite + wound on json/code/ariz failure samples is a queued follow-up (BD8).

Files

arbitration + critic+wound rescue path

Next concrete work

  1. BD8 critic+wound surgery for ARIZ schema repair

teacher-student: collect 50-100 (broken_json, fixed_json) pairs from existing TRIZ failures, train phys05_critic_lite_v2 + phys05_wound_v2, gate on rescue rate ≥ 30 % on this same harness.

  1. Port Python arbitrator → C++ runtime

pick_chain(route, pattern_hash, candidates) helper called from run_chat_ariz_organ_first (and code/json equivalents).

  1. Smarter "outside system reach" tiebreak — if max(conductance) <

threshold, mark task as system-failure rather than re-cycling chains. Saves wall time on hopeless tasks.