# BD7 — teacher retry v2 (2026-05-02)

**TL;DR — retry of 30 v1 losers with 3 prompt variants each (90 prompts)
recovered 6 → total winners 70 → 76/100. 24 still failing: 13 score=0
(no JSON from any variant), 10 score=5 (one field short — easy hand-fix),
1 score=3. Spot-check of 10 random v1 winners shows schema solid +
content mostly defensible (8/10 strong, 2/10 weak).**

## Numbers

| stage          | wall time   | new winners | cumulative |
|----------------|-------------|-------------|------------|
| forge v1       | 102 min     | 70/100      | 70/100     |
| retry v2 (3 variants) | 54 min | +6 / 30     | **76/100** |

Strict-JSON gate from spec was 90/100. We are at **76/100**. To bring it
to 90+ via teacher alone would need either many more retry rounds (each
~hour) or hand-fix.

## 24 still-loser breakdown

| best_score | count | tasks | next-step cost |
|------------|-------|-------|----------------|
| 5 / 6      | 10    | ARIZ/02, /06, /32, /34, /68, /72, /81, /86, /93, /94 | hand-fill 1 field per row, ~15 min |
| 3 / 6      | 1     | ARIZ/43 | hand-fill 3 fields, ~3 min |
| 0 / 6      | 13    | ARIZ/01, /03, /07, /18, /30, /41, /48, /61, /67, /77, /79, /88, /99 | full hand-write, ~30-40 min |

The 10 score=5 rows are the cheapest path — for each row, the v2 retry
candidate already has 5 of 6 fields; just need to add the missing one.

## Spot-check 10 random v1 winners

Random sample (seed=42): ARIZ/22, /10, /49, /44, /39, /25, /21, /57, /64, /82.

**Strong (8/10):**
* ARIZ/22 X-ray, ARIZ/10 Li-ion, ARIZ/49 loudspeaker, ARIZ/39 glass beaker,
  ARIZ/25 ship hull, ARIZ/21 solar panel, ARIZ/57 emergency siren,
  ARIZ/64 solar oven — clear TC/PC/IFR, concrete moves, plausible operators.

**Weak (2/10):**
* ARIZ/44 water pump — TC direction inverted ("speed decreases"
  while we want both flow AND speed up). PC inverted too.
* ARIZ/82 satellite array — IFR is generic fluff
  ("...without compromising either function"). Other 5 fields fine.

**Operator-name slip:** Sometimes 7B writes wrong TRIZ-40 number for the
right name (e.g. "46 Composite materials" — actual TRIZ-40 op 40 is
"Composite materials"). Mild concern: 0.5B might learn the wrong number.
But schema STRUCTURE is intact, which is what surgery is teaching.

**Verdict:** content quality is good enough for surgery to learn the
JSON schema and engineering-shape vocabulary. The 0.5B will not learn
TRIZ-40 perfectly anyway — what matters is that it learns to emit
6-field JSON with TC/PC/IFR + 3 arrays of plausible engineering items.

## Files written

* `data/organ_surgery/phys05_triz_contradiction/teacher_targets_v2.jsonl`
  (76 winners, ordered by task_id)
* `data/organ_surgery/phys05_triz_contradiction/teacher_targets_v2_losers.json`
  (24 still-losers with best_score / variant)
* `data/organ_surgery/phys05_triz_contradiction/teacher_candidates_raw.json`
  (200 v1 candidates — full set for reference)
* `tools/surgery/build_triz_teacher_targets.py` — v1 forge
* `tools/surgery/build_triz_teacher_retry.py` — v2 retry, 3-variant
* `reports/BD7_TEACHER_RETRY_V2.md` — this file

## Three viable next steps

1. **A — Hand-fix the 10 score=5 losers** (~15 min). Brings to 86/100.
2. **B — Full hand-curate of all 24 losers** (~45-50 min). Brings to 100/100.
3. **C — Proceed to QLoRA with 76** (no more curation). Smaller but
   workable training set. Eval on 100 will probably show the LoRA
   handles the 76 trained domains well and stumbles on the 24 untrained.

## Recommendation

**B if you want the strict 90+ gate; A if mid-effort; C for fastest
path to first surgery iteration.** Awaiting GO.
