BD7 Phase 2 — progress (2026-05-02)
Done in this turn
- ✅ Prompt template rewritten —
organs/prompts/phys05_triz_contradiction.txt
now requires the strict 6-field schema (technical_contradiction, physical_contradiction, ifr, resources, triz_operators, candidate_moves). Old 5-field schema (improves/worsens/resource_hints) gone.
- ✅ T1 baseline (8 tasks, fixed prompt) = 0/8 (1 organ leak).
- ✅ 92 additional ARIZ tasks curated → 100 total in
data/organ_surgery/phys05_triz_contradiction/ariz_tasks_v1.jsonl. Domains covered: aerospace, automotive, manufacturing, materials, process_eng, energy, mechanical, medical, electronics, optics, marine, civil, construction, hvac, robotics, communications, consumer, agriculture, packaging, military, mining, railway, leisure.
- ✅ T2 baseline (100 tasks, fixed prompt) = 0/100, organ_leaks=2,
fallback_count=0.
- ✅ Per-row failure category counts (out of 100):
technical_contradiction_missing_or_empty77physical_contradiction_missing_or_empty77ifr_missing_or_empty77resources_empty77triz_operators_empty77candidate_moves_empty76no_json(output didn't parse at all) 23- ✅ Sample raw outputs confirm the 0.5B organ emits JSON SHAPE 77 % of
the time but with hallucinated keys like technical_contradictions, physical_consituencies, irregularities, irf (typo), condition_conds. None match the required schema.
- ✅ C++ wired separate triz pack path (
PHYS05_TRIZ_PACK) so BD7
surgery will not disturb code_skeleton's anchor 19/19. Currently falls back to the same .planck file (identical behaviour) until BD7 produces a triz-trained pack.
Reports written
reports/BD7_TRIZ_BASELINE_T0.json— 0/8, old promptreports/BD7_TRIZ_BASELINE_T1_PROMPT_FIXED.json— 0/8, new promptreports/BD7_TRIZ_BASELINE_T2_N100.json— 0/100, new prompt, 100 tasksreports/BD7_PHASE2_PROGRESS.md— this file
Frozen state
production: PHYS05_PACK = physarum05b_code_skeleton.planck
PHYS05_TRIZ_PACK = physarum05b_code_skeleton.planck (fallback)
prompt md5 (new) : (re-hash after edit)
organ spec : rep=1.15, ngram=0, cuda_rep=1.08, max_tokens=160
code_skeleton bench : MBPP B 13/100, HE B 6/164, anchor 19/19 (unchanged)
triz organ-only T2 : 0/100 (the gap)
Phase 2 remaining
Step 5 of the user spec is build poison dataset. Each row needs:
- failed output (have — 100 captured in T2 report)
- verifier reason (have — categorized above)
- ideal target ← THIS IS THE BLOCKER
For BD6 we had anchor_positive.jsonl captured by running the production pack ITSELF on 19 prompts (model already passed those prompts). For BD7 the model passes ZERO of the 100 prompts, so we cannot extract ideal targets from it. Source for the 100 ideal TRIZ analyses (TC/PC/IFR/resources/operators/candidate_moves per task) must be specified.
Three viable sources, with tradeoffs:
A. Hand-curate from classical TRIZ literature
- Most rigorous, fully manual.
- Estimated effort: 5-10 hours of focused writing for 100 high-quality
TRIZ analyses. Cannot fit in one agent turn.
- Pure offline.
B. Use Physarium-7B (top brain) as offline teacher
- Spawn
--chatagainst the 7B with each of the 100 tasks +
ARIZ_KERNEL.md schema in the prompt. Capture outputs as candidate ideal targets.
- Then gate-validate each (JSON parse + 6-field non-empty + length
reasonable). Reject any that fail; backfill with hand-curate.
- Same pattern as BD6 anchor_positive capture (teacher = production).
- Estimated effort: 30 min compute + 1 hour curation review.
- User spec said "no 7B-generated synthetic bulk" — but in context of
TASK definitions. For TARGETS (training labels) this is the canonical offline teacher pattern.
C. Hybrid — 30 hand-curate + 70 from 7B
- Smallest viable training set ≈ 30 strong hand-curated targets
(covers core TRIZ-40 operators).
- 70 extra rows from 7B teacher to provide volume for QLoRA.
- Hand 30 alone might be too few for r=8 LoRA to generalize.
Recommendation
Option B, with strict gate-validation + spot-check of 10-20 samples by hand. Same structural pattern as BD6 worked. If 7B teacher quality is bad, fall back to C.
Awaiting user GO
Confirm one of:
- GO B (use 7B as offline teacher, capture 100 ideal targets)
- GO C (hybrid 30 hand + 70 7B-teacher)
- GO A (full hand-curate — multi-turn project)
After ideal targets exist, the rest of the pipeline (poison.jsonl build, QLoRA train via existing trainer machinery, gate, separate-pack merge) is mostly mechanical and should be 1-2 turns of automation.