TERMINAL_NANOOS_MINI_V1
Phase-12 NanoOS Capsule Substrate × Terminal-Bench-style suite (10 tasks). PARROT = one llama.cpp call -> extract bash -> ONE capsule run. MONSTER = envelope --chat -> C++ runtime drives k=1..3 stderr-feedback retry via tools/capsule/shell_capsule.py. Same capsule + verifier per task across both modes.
Both modes record an Evidence object at dag/capsules/cap_*.json containing capsule_id, exit codes per command, sha256 of every artifact file, and a replay_recipe (the full spec to re-execute deterministically). MONSTER's evidence is the survivor of the feedback loop; PARROT's is one-shot.
| mode | pass | total | rate | wall | VWS | |---|---|---|---|---|---| | PARROT | 7 | 10 | 70 % | 3.1s | 2.26 | | MONSTER | 8 | 10 | 80 % | 7.5s | 1.07 |
Δ MONSTER − PARROT: +1
Per-task
| task | diff | PARROT | MONSTER | rounds (M) | capsule_id (M) | artifacts (M) | |---|---|---|---|---|---|---| | create_file_exact | easy | OK | OK | 1 | 30_184916_dca840 | a948904f2f0f479b | | run_python_print_42 | easy | OK | OK | 1 | 30_184916_91e4c3 | b2101d5826aa11de | | fix_failing_test | hard | OK | X | 3 | 30_184918_311478 | d9ae93c739f49feb, e1a894022d1a0829, c24be9d84c1462cf | | parse_json | easy | OK | OK | 1 | 30_184918_43072a | 36bd4ed657a57ace | | sed_transform | medium | X | OK | 2 | 30_184919_1376a2 | ac914dfa543e017c, ac914dfa543e017c | | compile_cpp_missing_include | hard | X | OK | 2 | 30_184920_4f9f8e | 9d648401c7b6fb42, 3433be63e64588d1 | | chmod_run_executable | medium | OK | OK | 1 | 30_184921_5243b5 | f2e35109e0f7bfa8 | | find_bug_from_stderr | hard | X | X | 3 | 30_184922_950abf | ede1fb2f0d44a773 | | produce_patch | medium | OK | OK | 1 | 30_184922_24179c | 470450d1505afcca, 4fdbc441ea7b5461, 636b5bae55f8b988 | | verify_output_hash | medium | OK | OK | 1 | 30_184923_d28b3d | 5fc4ae2d24d613fb, 8bd186b55ecb5d98 |
Architectural witness
- PARROT pass rows still produce a capsule + DAG entry — the difference is that MONSTER's evidence is what survived a retry, while PARROT's is the single shot.
- Every MONSTER pass row has a non-empty
capsule_idand ≥1 artifact sha256. Replay any of them by running the capsule with the spec dumped insidereplay_recipe.spec_inline. - For MONSTER rows where
rounds > 1, the C++ runtime fed stderr+exit codes from the k-1 capsule into the next prompt — that's the execution evidence the doctrine demands.
Tasks (10)
create_file_exact(easy) — Create a file named hello.txt in the current directory whose contents are exactly the line `hello worun_python_print_42(easy) — Write a python scriptsolve.pythat prints exactly the integer 42 and run it with python3. Use shefix_failing_test(hard) — There are two files:math_lib.pyandrun_tests.py. Run the tests withpython3 run_tests.py. Thparse_json(easy) — There is a filedata.jsonin the working dir. Use python3 to read it and print the value of the kesed_transform(medium) — There is a filedata.txtwith three lines likekey: N. Produce a fileout.txtwhere every keycompile_cpp_missing_include(hard) — There is a fileprog.cppin the working dir. Compile it withg++ prog.cpp -o prog. Then run `./pchmod_run_executable(medium) — There is a filescript.shin the working dir. Make it executable and run it (./script.sh). Its outfind_bug_from_stderr(hard) — Runpython3 app.py. The output must contain the literal token OK_PARSED. If the program raises anproduce_patch(medium) — There are two filesbefore.txtandafter.txt. Produce a unified diff and save it aspatch.diffverify_output_hash(medium) — There is a filepayload.txtin the working dir. Compute its sha256 hash with `sha256sum payload.tx
DOD
- GREEN — MONSTER 8/10 > PARROT 7/10; execution layer earned its place by Δ +1.