X100_SCOREBOARD — sprint progress (one truth file)
Single file. New numbers only — no philosophy, no tied placeholders. Every axis updated when its proving bench produces a score.
RULE (rebased 2026-04-30): sha256(input) exact-input cache hits are utility hygiene, NOT a strategic x100 advantage. Real x100 only counts when input is non-identical, system extracts a form pattern + delta params, replays an action template, verifier passes, and the model is NOT called.
| Axis | current | target | status | |--------------------------|-----------------------|-----------------------|--------| | Raw decode (Q4 7B) | 41.7 tok/s (DP4A) | 100+ tok/s | YELLOW | | llama.cpp raw | 80+ tok/s | 100+ tok/s | YELLOW | | MBPP (n=100, native) | B=13/100 (production), A=60, C=60. BD6.2 union/4ep regressed to 6, reverted. BD6.3 anchor-gate 0/19, rejected at gate. | B ≥ 25 first | YELLOW — production stable at 13; BD6.4 needs anchor-positive curriculum | | HumanEval (n=164, full) | B=6/164 (production), A=81, C=81. Same BD6.2/BD6.3 history. | B ≥ 20 first | YELLOW — production stable at 6; BD6.4 next | | LCB easy (n=50) | B=0/50 (post-route-fix), organs={phys05_code_skeleton}, fb=0. Pre-fix was routing artefact (triz). | B ≥ 3 after BD6.4 | YELLOW — route fixed; data needs competitive-programming refs | | Anchor (BD6 pass-1 wins) | 19/19 verified on production pack | gate must stay 19/19 before any pack flip | GREEN | | Terminal NanoOS V1 (10) | PARROT 7 / MONSTER 8 stable; 9 in best-of-N (+1 / +2) | MONSTER ≥9 | YELLOW | | Terminal NanoOS V2 (30) | PARROT 20 / MONSTER 22 (+2), wall ratio 2.35× | MONSTER − PARROT ≥ +8 | YELLOW | | Exact replay cache | 5/5 exact repeats: ~2ms warm, 140-403× speedup | utility cache hygiene | GREEN (UTILITY) | | Holographic form replay | 20/20 non-identical variants, all <100ms, all model_called:false | ≥15/20 non-identical variants pass with no 7B call | GREEN | | Organ farm liveness | 9/9 organs fired with food=1; 5 dead organs revived; phys05_wound born and live | every --chat DAG carries multi-organ chain | YELLOW (alive but not yet routed in --chat) | | Black-Dog reinforcement | 9/9 organs have BD-moved conductance (0.000 → 0.20-0.67 in 5 prompts each) | every --chat DAG has real food/poison/cond_before/cond_after; conductance influences routing | GREEN (signal layer) / YELLOW (router not yet using cond) | | ARIZ/TRIZ organ-first chain | smoke green: triz fail → 7B-chat synth pass; verifier "TC+PC both filled" | 100 ARIZ task probe; triz solo pass-rate ≥40%; chain pass-rate ≥85% | YELLOW (architecturally GREEN, model-quality YELLOW) | | MONSTER_INTEGRATION_V1 | 8/8 routes landed, 6/6 active non-cache routes have BD signal in DAG (json_repair fall-through closed 2026-05-01) | every route lands + has BD signal | GREEN (architecturally one body, every lane proven) | | Organ baseline probe | 45/45 probes ok=true; 100 % pass-rate at sane_nonempty verifier; per-organ BD curves in reports/ORGAN_BASELINE_PROBE.md | per-organ STRICT verifier (json_strict / TC-PC / etc) — BD4 work | GREEN | | Prefix-cache retry | not implemented | MBPP wall <100s @ ≥73 | RED | | Runtime-owned schemas | not implemented | tokens −40 % | RED | | SWE micro 10 | none | MONSTER − PARROT ≥ +4 | RED | | Memory 350 lookup | smoke only | <1ms, 100/100 | RED | | Self-repair runtime | partial (templates) | 3 auto repairs | RED | | Official bench queue | partial | one status doc | YELLOW |
Last updated: 2026-05-01. Owner: agent (auto-update after each numbered TASK).
BENCH_CLEANUP_AND_OFFICIAL_RUN status
- TASK 1 (json_repair fall-through fix) — DONE 2026-05-01.
Route 8/8, BD signal on 6/6 active non-cache routes. See reports/MONSTER_INTEGRATION_V1.md.
- TASK 2 (MBPP/HE × A/B/C) — HARNESS READY.
NO_7B_FALLBACK env gate landed in run_chat_organ_route and run_chat HumanEval branch. Bench: tools/bench/mbpp_he_3mode.py. Mode A = MONSTER_FORCE_7B (or external llama-server when reachable). Mode B = ORGAN_FIRST=1 + NO_7B_FALLBACK=1 (raw 0.5B only). Mode C = ORGAN_FIRST=1 + MONSTER_NATIVE_RETRY=1 (organ + 7B fallback).
- TASK 3 (LiveCodeBench / BFCL official / GPQA Diamond) — pending.
- TASK 4 (single unified table) — pending.
Why exact-cache is GREEN-UTILITY but NOT x100 advantage
HOLOGRAM_REPLAY_X100 (5/5 workflows, 140-403× speedup) was renamed to EXACT_REPLAY_CACHE_V1. It proves a memoization layer: same exact input → same exact output, no model call. Useful for any workflow that genuinely repeats verbatim. But it is not strategic intelligence: two prompts that differ in a single character bypass it entirely.
Real x100 advantage requires the system to recognize the FORM of a task across surface variations and replay the action template with new parameters. That is HOLOGRAPHIC_FORM_REPLAY_V1, which is the next sprint axis below.
Sprint order (no choice; do in order)
- ✅ PHASE_12_TR_HEREDOC_AWARE → Terminal V1 +1 stable / +2 best
- ✅ TERMINAL_NANOOS_30 → +2 (target +8 model-class gated)
- ✅ EXACT_REPLAY_CACHE_V1 → utility hygiene GREEN, not x100
- ⏳ HOLOGRAPHIC_FORM_REPLAY_V1 → real x100: ≥15/20 form variants
- PREFIX_CACHE_RETRY → MBPP wall <100s
- RUNTIME_OWNED_OUTPUTS → tokens −40 %
- SWE_MICRO_CAPSULE_V1 → +4 vs PARROT
- MEMORY_350_PROOF_V1 → exact lookup <1ms
- SELF_REPAIR_RUNTIME_V1 → 3 auto repairs
- OFFICIAL_BENCH_QUEUE_STATUS → status doc
A pass without a new number does not count.