BD7 Phase 2 — progress (2026-05-02)

Done in this turn

✅ Prompt template rewritten — organs/prompts/phys05_triz_contradiction.txt

now requires the strict 6-field schema (technical_contradiction, physical_contradiction, ifr, resources, triz_operators, candidate_moves). Old 5-field schema (improves/worsens/resource_hints) gone.

✅ T1 baseline (8 tasks, fixed prompt) = 0/8 (1 organ leak).
✅ 92 additional ARIZ tasks curated → 100 total in

data/organ_surgery/phys05_triz_contradiction/ariz_tasks_v1.jsonl. Domains covered: aerospace, automotive, manufacturing, materials, process_eng, energy, mechanical, medical, electronics, optics, marine, civil, construction, hvac, robotics, communications, consumer, agriculture, packaging, military, mining, railway, leisure.

✅ T2 baseline (100 tasks, fixed prompt) = 0/100, organ_leaks=2,

fallback_count=0.

✅ Per-row failure category counts (out of 100):
technical_contradiction_missing_or_empty 77
physical_contradiction_missing_or_empty 77
ifr_missing_or_empty 77
resources_empty 77
triz_operators_empty 77
candidate_moves_empty 76
no_json (output didn't parse at all) 23
✅ Sample raw outputs confirm the 0.5B organ emits JSON SHAPE 77 % of

the time but with hallucinated keys like technical_contradictions, physical_consituencies, irregularities, irf (typo), condition_conds. None match the required schema.

✅ C++ wired separate triz pack path (PHYS05_TRIZ_PACK) so BD7

surgery will not disturb code_skeleton's anchor 19/19. Currently falls back to the same .planck file (identical behaviour) until BD7 produces a triz-trained pack.

Reports written

reports/BD7_TRIZ_BASELINE_T0.json — 0/8, old prompt
reports/BD7_TRIZ_BASELINE_T1_PROMPT_FIXED.json — 0/8, new prompt
reports/BD7_TRIZ_BASELINE_T2_N100.json — 0/100, new prompt, 100 tasks
reports/BD7_PHASE2_PROGRESS.md — this file

Frozen state

production:  PHYS05_PACK         = physarum05b_code_skeleton.planck
             PHYS05_TRIZ_PACK    = physarum05b_code_skeleton.planck (fallback)
             prompt md5 (new)    : (re-hash after edit)
             organ spec          : rep=1.15, ngram=0, cuda_rep=1.08, max_tokens=160
             code_skeleton bench : MBPP B 13/100, HE B 6/164, anchor 19/19  (unchanged)
             triz organ-only T2  : 0/100  (the gap)

Phase 2 remaining

Step 5 of the user spec is build poison dataset. Each row needs:

failed output (have — 100 captured in T2 report)
verifier reason (have — categorized above)
ideal target ← THIS IS THE BLOCKER

For BD6 we had anchor_positive.jsonl captured by running the production pack ITSELF on 19 prompts (model already passed those prompts). For BD7 the model passes ZERO of the 100 prompts, so we cannot extract ideal targets from it. Source for the 100 ideal TRIZ analyses (TC/PC/IFR/resources/operators/candidate_moves per task) must be specified.

Three viable sources, with tradeoffs:

A. Hand-curate from classical TRIZ literature

Most rigorous, fully manual.
Estimated effort: 5-10 hours of focused writing for 100 high-quality

TRIZ analyses. Cannot fit in one agent turn.

Pure offline.

B. Use Physarium-7B (top brain) as offline teacher

Spawn --chat against the 7B with each of the 100 tasks +

ARIZ_KERNEL.md schema in the prompt. Capture outputs as candidate ideal targets.

Then gate-validate each (JSON parse + 6-field non-empty + length

reasonable). Reject any that fail; backfill with hand-curate.

Same pattern as BD6 anchor_positive capture (teacher = production).
Estimated effort: 30 min compute + 1 hour curation review.
User spec said "no 7B-generated synthetic bulk" — but in context of

TASK definitions. For TARGETS (training labels) this is the canonical offline teacher pattern.

C. Hybrid — 30 hand-curate + 70 from 7B

Smallest viable training set ≈ 30 strong hand-curated targets

(covers core TRIZ-40 operators).

70 extra rows from 7B teacher to provide volume for QLoRA.
Hand 30 alone might be too few for r=8 LoRA to generalize.

Recommendation

Option B, with strict gate-validation + spot-check of 10-20 samples by hand. Same structural pattern as BD6 worked. If 7B teacher quality is bad, fall back to C.

Awaiting user GO

Confirm one of:

GO B (use 7B as offline teacher, capture 100 ideal targets)
GO C (hybrid 30 hand + 70 7B-teacher)
GO A (full hand-curate — multi-turn project)

After ideal targets exist, the rest of the pipeline (poison.jsonl build, QLoRA train via existing trainer machinery, gate, separate-pack merge) is mostly mechanical and should be 1-2 turns of automation.