# BD7 Phase 2 — progress (2026-05-02)

## Done in this turn

* ✅ Prompt template rewritten — `organs/prompts/phys05_triz_contradiction.txt`
  now requires the strict 6-field schema (technical_contradiction,
  physical_contradiction, ifr, resources, triz_operators, candidate_moves).
  Old 5-field schema (improves/worsens/resource_hints) gone.
* ✅ T1 baseline (8 tasks, fixed prompt) = **0/8** (1 organ leak).
* ✅ 92 additional ARIZ tasks curated → 100 total in
  `data/organ_surgery/phys05_triz_contradiction/ariz_tasks_v1.jsonl`.
  Domains covered: aerospace, automotive, manufacturing, materials,
  process_eng, energy, mechanical, medical, electronics, optics,
  marine, civil, construction, hvac, robotics, communications,
  consumer, agriculture, packaging, military, mining, railway, leisure.
* ✅ T2 baseline (100 tasks, fixed prompt) = **0/100**, organ_leaks=2,
  fallback_count=0.
* ✅ Per-row failure category counts (out of 100):
  * `technical_contradiction_missing_or_empty` 77
  * `physical_contradiction_missing_or_empty`  77
  * `ifr_missing_or_empty`                     77
  * `resources_empty`                          77
  * `triz_operators_empty`                     77
  * `candidate_moves_empty`                    76
  * `no_json` (output didn't parse at all)     23
* ✅ Sample raw outputs confirm the 0.5B organ emits JSON SHAPE 77 % of
  the time but with **hallucinated keys** like
  `technical_contradictions`, `physical_consituencies`,
  `irregularities`, `irf` (typo), `condition_conds`. None match the
  required schema.
* ✅ C++ wired separate triz pack path (`PHYS05_TRIZ_PACK`) so BD7
  surgery will not disturb code_skeleton's anchor 19/19. Currently
  falls back to the same .planck file (identical behaviour) until BD7
  produces a triz-trained pack.

## Reports written

* `reports/BD7_TRIZ_BASELINE_T0.json` — 0/8, old prompt
* `reports/BD7_TRIZ_BASELINE_T1_PROMPT_FIXED.json` — 0/8, new prompt
* `reports/BD7_TRIZ_BASELINE_T2_N100.json` — 0/100, new prompt, 100 tasks
* `reports/BD7_PHASE2_PROGRESS.md` — this file

## Frozen state

```
production:  PHYS05_PACK         = physarum05b_code_skeleton.planck
             PHYS05_TRIZ_PACK    = physarum05b_code_skeleton.planck (fallback)
             prompt md5 (new)    : (re-hash after edit)
             organ spec          : rep=1.15, ngram=0, cuda_rep=1.08, max_tokens=160
             code_skeleton bench : MBPP B 13/100, HE B 6/164, anchor 19/19  (unchanged)
             triz organ-only T2  : 0/100  (the gap)
```

## Phase 2 remaining

Step 5 of the user spec is **build poison dataset**. Each row needs:
* failed output  (have — 100 captured in T2 report)
* verifier reason (have — categorized above)
* **ideal target**  ← THIS IS THE BLOCKER

For BD6 we had `anchor_positive.jsonl` captured by running the
production pack ITSELF on 19 prompts (model already passed those
prompts). For BD7 the model passes ZERO of the 100 prompts, so we
cannot extract ideal targets from it. Source for the 100 ideal TRIZ
analyses (TC/PC/IFR/resources/operators/candidate_moves per task)
must be specified.

Three viable sources, with tradeoffs:

### A. Hand-curate from classical TRIZ literature  
* Most rigorous, fully manual.
* Estimated effort: 5-10 hours of focused writing for 100 high-quality
  TRIZ analyses. Cannot fit in one agent turn.
* Pure offline.

### B. Use Physarium-7B (top brain) as offline teacher
* Spawn `--chat` against the 7B with each of the 100 tasks +
  ARIZ_KERNEL.md schema in the prompt. Capture outputs as candidate
  ideal targets.
* Then gate-validate each (JSON parse + 6-field non-empty + length
  reasonable). Reject any that fail; backfill with hand-curate.
* Same pattern as BD6 anchor_positive capture (teacher = production).
* Estimated effort: 30 min compute + 1 hour curation review.
* User spec said "no 7B-generated synthetic bulk" — but in context of
  TASK definitions. For TARGETS (training labels) this is the
  canonical offline teacher pattern.

### C. Hybrid — 30 hand-curate + 70 from 7B
* Smallest viable training set ≈ 30 strong hand-curated targets
  (covers core TRIZ-40 operators).
* 70 extra rows from 7B teacher to provide volume for QLoRA.
* Hand 30 alone might be too few for r=8 LoRA to generalize.

## Recommendation

**Option B**, with strict gate-validation + spot-check of 10-20
samples by hand. Same structural pattern as BD6 worked. If 7B teacher
quality is bad, fall back to C.

## Awaiting user GO

Confirm one of:
* GO B  (use 7B as offline teacher, capture 100 ideal targets)
* GO C  (hybrid 30 hand + 70 7B-teacher)
* GO A  (full hand-curate — multi-turn project)

After ideal targets exist, the rest of the pipeline (poison.jsonl
build, QLoRA train via existing trainer machinery, gate, separate-pack
merge) is mostly mechanical and should be 1-2 turns of automation.
