BD7 — teacher retry v2 (2026-05-02)

TL;DR — retry of 30 v1 losers with 3 prompt variants each (90 prompts) recovered 6 → total winners 70 → 76/100. 24 still failing: 13 score=0 (no JSON from any variant), 10 score=5 (one field short — easy hand-fix), 1 score=3. Spot-check of 10 random v1 winners shows schema solid + content mostly defensible (8/10 strong, 2/10 weak).

Numbers

| stage | wall time | new winners | cumulative | |----------------|-------------|-------------|------------| | forge v1 | 102 min | 70/100 | 70/100 | | retry v2 (3 variants) | 54 min | +6 / 30 | 76/100 |

Strict-JSON gate from spec was 90/100. We are at 76/100. To bring it to 90+ via teacher alone would need either many more retry rounds (each ~hour) or hand-fix.

24 still-loser breakdown

| best_score | count | tasks | next-step cost | |------------|-------|-------|----------------| | 5 / 6 | 10 | ARIZ/02, /06, /32, /34, /68, /72, /81, /86, /93, /94 | hand-fill 1 field per row, ~15 min | | 3 / 6 | 1 | ARIZ/43 | hand-fill 3 fields, ~3 min | | 0 / 6 | 13 | ARIZ/01, /03, /07, /18, /30, /41, /48, /61, /67, /77, /79, /88, /99 | full hand-write, ~30-40 min |

The 10 score=5 rows are the cheapest path — for each row, the v2 retry candidate already has 5 of 6 fields; just need to add the missing one.

Spot-check 10 random v1 winners

Random sample (seed=42): ARIZ/22, /10, /49, /44, /39, /25, /21, /57, /64, /82.

Strong (8/10):

ARIZ/22 X-ray, ARIZ/10 Li-ion, ARIZ/49 loudspeaker, ARIZ/39 glass beaker,

ARIZ/25 ship hull, ARIZ/21 solar panel, ARIZ/57 emergency siren, ARIZ/64 solar oven — clear TC/PC/IFR, concrete moves, plausible operators.

Weak (2/10):

ARIZ/44 water pump — TC direction inverted ("speed decreases"

while we want both flow AND speed up). PC inverted too.

ARIZ/82 satellite array — IFR is generic fluff

("...without compromising either function"). Other 5 fields fine.

Operator-name slip: Sometimes 7B writes wrong TRIZ-40 number for the right name (e.g. "46 Composite materials" — actual TRIZ-40 op 40 is "Composite materials"). Mild concern: 0.5B might learn the wrong number. But schema STRUCTURE is intact, which is what surgery is teaching.

Verdict: content quality is good enough for surgery to learn the JSON schema and engineering-shape vocabulary. The 0.5B will not learn TRIZ-40 perfectly anyway — what matters is that it learns to emit 6-field JSON with TC/PC/IFR + 3 arrays of plausible engineering items.

Files written

data/organ_surgery/phys05_triz_contradiction/teacher_targets_v2.jsonl

(76 winners, ordered by task_id)

data/organ_surgery/phys05_triz_contradiction/teacher_targets_v2_losers.json

(24 still-losers with best_score / variant)

data/organ_surgery/phys05_triz_contradiction/teacher_candidates_raw.json

(200 v1 candidates — full set for reference)

tools/surgery/build_triz_teacher_targets.py — v1 forge
tools/surgery/build_triz_teacher_retry.py — v2 retry, 3-variant
reports/BD7_TEACHER_RETRY_V2.md — this file

Three viable next steps

A — Hand-fix the 10 score=5 losers (~15 min). Brings to 86/100.
B — Full hand-curate of all 24 losers (~45-50 min). Brings to 100/100.
C — Proceed to QLoRA with 76 (no more curation). Smaller but

workable training set. Eval on 100 will probably show the LoRA handles the 76 trained domains well and stumbles on the 24 untrained.

Recommendation

B if you want the strict 90+ gate; A if mid-effort; C for fastest path to first surgery iteration. Awaiting GO.