TASK 3 — official benches status (2026-05-01)

Per user TASK 3: "Run LiveCodeBench subset, BFCL official/hard subset, GPQA Diamond subset. Use official harness/dataset where possible. If blocked, blocker must be exact: auth/dataset/dependency/GPU/harness."

Per-bench status

LiveCodeBench  ✅ HARNESS READY, BLOCKED ON GPU

Dataset: livecodebench/code_generation (HF, public, no auth required).
Schema verified: question_content (prompt), public_test_cases (stdin/stdout JSON list), difficulty (easy/medium/hard).
Harness: tools/bench/livecodebench_3mode.py — same A/B/C modes as MBPP harness, scores by running candidate Python program against every public test case after python3 -c <prog> with stdin from each tc['input']. Default --difficulty easy --n 50.
Note on code_generation_lite: that variant uses a deprecated dataset script (Dataset scripts are no longer supported); the non-lite variant works without modification.
Blocker: GPU contention — TASK 2 full N=100/164 × 3 modes is occupying the single RTX 3060 Ti for ~6 hours. LiveCodeBench can fire after.

BFCL official hard subset  ❌ BLOCKED — DEPENDENCY

Tried dataset: gorilla-llm/Berkeley-Function-Calling-Leaderboard.
Exact error:

`` DataFilesNotFoundError: No (supported) data files found in gorilla-llm/Berkeley-Function-Calling-Leaderboard ``

The official BFCL data is not packaged as a HuggingFace dataset. The official path requires their own gorilla-bfcl Python package (pip install bfcl-eval) and their CLI runner (bfcl run --model <name> --test-category multi_turn_long_context,...). That CLI is built around OpenAI/Anthropic API hosts; pointing it at a local --chat-shaped binary needs custom glue.

Blocker = dependency. Requires installing bfcl-eval and writing an adapter that wraps ./build/gigachad_native --chat <prompt> as a fake API endpoint the BFCL CLI can hit. ~3–4 hours additional engineering.
Existing tools/bench/bfcl_subset.py: that's the hand-curated 10-problem smoke that TASK 5 explicitly disqualifies. Kept as a smoke; not a substitute for the official BFCL.

GPQA Diamond  ❌ BLOCKED — AUTH

Tried dataset: Idavidrein/gpqa, config gpqa_diamond.
Exact error:

`` DatasetNotFoundError: Dataset 'Idavidrein/gpqa' is a gated dataset on the Hub. You must be authenticated to access it. ``

Blocker = auth. Requires huggingface-cli login with a token that has accepted the gated-dataset terms on https://huggingface.co/datasets/Idavidrein/gpqa. Then HF_TOKEN=... env var lets the dataset load.
No fake substitute — TASK 5 explicitly bans hand-made science-MCQ subsets.

Summary

| bench | dataset access | harness | runs after TASK 2 done? | |-----------------|----------------|---------|--------------------------| | LiveCodeBench | ✅ public | ✅ ready | YES — ungated | | BFCL official | ❌ no HF dataset; needs bfcl-eval | not written | gated on dependency install + adapter glue | | GPQA Diamond | ❌ gated | not written | gated on HF_TOKEN + terms acceptance |

What runs as soon as TASK 2 finishes

cd ~/gigachad_native
python3 tools/bench/livecodebench_3mode.py --n 50 --difficulty easy --modes A,B,C
# → reports/LIVECODEBENCH_3MODE_V1.{md,json}

To unblock the other two, the user must:

For BFCL: pip install bfcl-eval → write a tiny FastAPI shim around ./build/gigachad_native --chat that returns OpenAI-shape JSON, point bfcl run --base-url http://localhost:NNNN/v1 at it. Then run mode B (NO_7B_FALLBACK=1) and mode C (default) by setting env on the shim.
For GPQA: huggingface-cli login, accept Idavidrein/gpqa terms in browser, then a 1-pass MCQ harness can be written against gpqa_diamond config (~40 lines, similar to MBPP harness). 198 questions per Diamond split.

These blockers are not runtime issues. The TASK 1 plumbing (organ-first chain, NO_7B_FALLBACK) is already in place; the only thing missing is the data pipe from each official source into the same A/B/C harness shape.

TASK 3 — official benches status (2026-05-01)

TASK 3 — official benches status (2026-05-01)

Per-bench status

LiveCodeBench &nbsp;✅ HARNESS READY, BLOCKED ON GPU

BFCL official hard subset &nbsp;❌ BLOCKED — DEPENDENCY

GPQA Diamond &nbsp;❌ BLOCKED — AUTH

Summary

What runs as soon as TASK 2 finishes

LiveCodeBench ✅ HARNESS READY, BLOCKED ON GPU

BFCL official hard subset ❌ BLOCKED — DEPENDENCY

GPQA Diamond ❌ BLOCKED — AUTH