# TASK 3 — official benches status (2026-05-01)

Per user TASK 3: "Run LiveCodeBench subset, BFCL official/hard subset,
GPQA Diamond subset. Use official harness/dataset where possible.
If blocked, blocker must be exact: auth/dataset/dependency/GPU/harness."

## Per-bench status

### LiveCodeBench &nbsp;✅ HARNESS READY, BLOCKED ON GPU

* **Dataset:** `livecodebench/code_generation` (HF, public, no auth required).
* **Schema verified:** `question_content` (prompt), `public_test_cases` (stdin/stdout JSON list), `difficulty` (easy/medium/hard).
* **Harness:** `tools/bench/livecodebench_3mode.py` — same A/B/C modes as MBPP harness, scores by running candidate Python program against every public test case after `python3 -c <prog>` with stdin from each `tc['input']`. Default `--difficulty easy --n 50`.
* **Note on `code_generation_lite`:** that variant uses a deprecated dataset script (`Dataset scripts are no longer supported`); the non-lite variant works without modification.
* **Blocker:** **GPU contention** — TASK 2 full N=100/164 × 3 modes is occupying the single RTX 3060 Ti for ~6 hours. LiveCodeBench can fire after.

### BFCL official hard subset &nbsp;❌ BLOCKED — DEPENDENCY

* **Tried dataset:** `gorilla-llm/Berkeley-Function-Calling-Leaderboard`.
* **Exact error:**

  ```
  DataFilesNotFoundError: No (supported) data files found in
  gorilla-llm/Berkeley-Function-Calling-Leaderboard
  ```

  The official BFCL data is **not packaged as a HuggingFace dataset.** The official path requires their own `gorilla-bfcl` Python package (`pip install bfcl-eval`) and their CLI runner (`bfcl run --model <name> --test-category multi_turn_long_context,...`). That CLI is built around OpenAI/Anthropic API hosts; pointing it at a local `--chat`-shaped binary needs custom glue.
* **Blocker = dependency.** Requires installing `bfcl-eval` and writing an adapter that wraps `./build/gigachad_native --chat <prompt>` as a fake API endpoint the BFCL CLI can hit. ~3–4 hours additional engineering.
* **Existing `tools/bench/bfcl_subset.py`:** that's the hand-curated 10-problem smoke that TASK 5 explicitly disqualifies. Kept as a smoke; **not** a substitute for the official BFCL.

### GPQA Diamond &nbsp;❌ BLOCKED — AUTH

* **Tried dataset:** `Idavidrein/gpqa`, config `gpqa_diamond`.
* **Exact error:**

  ```
  DatasetNotFoundError: Dataset 'Idavidrein/gpqa' is a gated dataset on the Hub.
  You must be authenticated to access it.
  ```

* **Blocker = auth.** Requires `huggingface-cli login` with a token that has accepted the gated-dataset terms on `https://huggingface.co/datasets/Idavidrein/gpqa`. Then `HF_TOKEN=...` env var lets the dataset load.
* **No fake substitute** — TASK 5 explicitly bans hand-made science-MCQ subsets.

## Summary

| bench           | dataset access | harness | runs after TASK 2 done? |
|-----------------|----------------|---------|--------------------------|
| LiveCodeBench   | ✅ public      | ✅ ready | YES — ungated            |
| BFCL official   | ❌ no HF dataset; needs `bfcl-eval` | not written | gated on dependency install + adapter glue |
| GPQA Diamond    | ❌ gated        | not written | gated on `HF_TOKEN` + terms acceptance |

## What runs as soon as TASK 2 finishes

```bash
cd /home/pc/gigachad_native
python3 tools/bench/livecodebench_3mode.py --n 50 --difficulty easy --modes A,B,C
# → reports/LIVECODEBENCH_3MODE_V1.{md,json}
```

To unblock the other two, the user must:

* For **BFCL**: `pip install bfcl-eval` → write a tiny FastAPI shim around `./build/gigachad_native --chat` that returns OpenAI-shape JSON, point `bfcl run --base-url http://localhost:NNNN/v1` at it. Then run mode B (`NO_7B_FALLBACK=1`) and mode C (default) by setting env on the shim.
* For **GPQA**: `huggingface-cli login`, accept `Idavidrein/gpqa` terms in browser, then a 1-pass MCQ harness can be written against `gpqa_diamond` config (~40 lines, similar to MBPP harness). 198 questions per Diamond split.

These blockers are **not** runtime issues. The TASK 1 plumbing (organ-first chain, `NO_7B_FALLBACK`) is already in place; the only thing missing is the data pipe from each official source into the same A/B/C harness shape.
