SOVEREIGN_COGNITION_GAUNTLET_V1
Big Tech-style coding bench × repeat-learning axis.
_6 problems × 10 rounds × 2 backends. Same Physarium-7B Q4 weights via llama.cpp. PARROT = pure HTTP call. MONSTER_LEARNING = full --chat runtime with scroll injection and admin-self-seed-on-fail._
Pass-rate per round
| round | PARROT | MONSTER_LEARNING | |---|---|---| | 1 | 6/6 | 5/6 | | 2 | 6/6 | 6/6 | | 3 | 6/6 | 6/6 | | 4 | 6/6 | 6/6 | | 5 | 6/6 | 6/6 | | 6 | 6/6 | 6/6 | | 7 | 6/6 | 6/6 | | 8 | 6/6 | 6/6 | | 9 | 6/6 | 6/6 | | 10 | 6/6 | 6/6 |
PARROT total: 60/60 = 100% MONSTER_LEARNING total: 59/60 = 98% Δ: +-1 passes (-2 pp)
Per-problem first-pass round
| problem | PARROT first-pass | MONSTER first-pass | seed injected after | |---|---|---|---| | is_prime | 1 | 1 (post-seed pass=—) | — | | fizzbuzz_list | 1 | 1 (post-seed pass=—) | — | | roman_to_int | 1 | 1 (post-seed pass=—) | — | | is_balanced | 1 | 1 (post-seed pass=—) | — | | count_unique_chars | 1 | 2 (post-seed pass=9/9) | — | | flatten_nested | 1 | 1 (post-seed pass=—) | — |
DOD
RED — MONSTER worse than PARROT.
Why this differs from HumanEval / MBPP / SWE-bench
Standard benches under temp=0 produce ONE pass-rate number per problem. Our gauntlet runs the same problem 10× and asks: if the system fails, can it be made to learn? PARROT under temp=0 gives the same wrong answer 10× — that's a flat fail line. MONSTER_LEARNING gets an admin-written exemplar between rounds (the "self-LoRA-with-admin-rights" loop in cheap form) and the curve actually rises. A standard bench cannot express this question.
_Raw: reports/SOVEREIGN_COGNITION_GAUNTLET_V1.json_