TASK 3 — official benches status (2026-05-01)
Per user TASK 3: "Run LiveCodeBench subset, BFCL official/hard subset, GPQA Diamond subset. Use official harness/dataset where possible. If blocked, blocker must be exact: auth/dataset/dependency/GPU/harness."
Per-bench status
LiveCodeBench ✅ HARNESS READY, BLOCKED ON GPU
- Dataset:
livecodebench/code_generation(HF, public, no auth required). - Schema verified:
question_content(prompt),public_test_cases(stdin/stdout JSON list),difficulty(easy/medium/hard). - Harness:
tools/bench/livecodebench_3mode.py— same A/B/C modes as MBPP harness, scores by running candidate Python program against every public test case afterpython3 -c <prog>with stdin from eachtc['input']. Default--difficulty easy --n 50. - Note on
code_generation_lite: that variant uses a deprecated dataset script (Dataset scripts are no longer supported); the non-lite variant works without modification. - Blocker: GPU contention — TASK 2 full N=100/164 × 3 modes is occupying the single RTX 3060 Ti for ~6 hours. LiveCodeBench can fire after.
BFCL official hard subset ❌ BLOCKED — DEPENDENCY
- Tried dataset:
gorilla-llm/Berkeley-Function-Calling-Leaderboard. - Exact error:
`` DataFilesNotFoundError: No (supported) data files found in gorilla-llm/Berkeley-Function-Calling-Leaderboard ``
The official BFCL data is not packaged as a HuggingFace dataset. The official path requires their own gorilla-bfcl Python package (pip install bfcl-eval) and their CLI runner (bfcl run --model <name> --test-category multi_turn_long_context,...). That CLI is built around OpenAI/Anthropic API hosts; pointing it at a local --chat-shaped binary needs custom glue.
- Blocker = dependency. Requires installing
bfcl-evaland writing an adapter that wraps./build/gigachad_native --chat <prompt>as a fake API endpoint the BFCL CLI can hit. ~3–4 hours additional engineering. - Existing
tools/bench/bfcl_subset.py: that's the hand-curated 10-problem smoke that TASK 5 explicitly disqualifies. Kept as a smoke; not a substitute for the official BFCL.
GPQA Diamond ❌ BLOCKED — AUTH
- Tried dataset:
Idavidrein/gpqa, configgpqa_diamond. - Exact error:
`` DatasetNotFoundError: Dataset 'Idavidrein/gpqa' is a gated dataset on the Hub. You must be authenticated to access it. ``
- Blocker = auth. Requires
huggingface-cli loginwith a token that has accepted the gated-dataset terms onhttps://huggingface.co/datasets/Idavidrein/gpqa. ThenHF_TOKEN=...env var lets the dataset load. - No fake substitute — TASK 5 explicitly bans hand-made science-MCQ subsets.
Summary
| bench | dataset access | harness | runs after TASK 2 done? | |-----------------|----------------|---------|--------------------------| | LiveCodeBench | ✅ public | ✅ ready | YES — ungated | | BFCL official | ❌ no HF dataset; needs bfcl-eval | not written | gated on dependency install + adapter glue | | GPQA Diamond | ❌ gated | not written | gated on HF_TOKEN + terms acceptance |
What runs as soon as TASK 2 finishes
cd /home/pc/gigachad_native
python3 tools/bench/livecodebench_3mode.py --n 50 --difficulty easy --modes A,B,C
# → reports/LIVECODEBENCH_3MODE_V1.{md,json}
To unblock the other two, the user must:
- For BFCL:
pip install bfcl-eval→ write a tiny FastAPI shim around./build/gigachad_native --chatthat returns OpenAI-shape JSON, pointbfcl run --base-url http://localhost:NNNN/v1at it. Then run mode B (NO_7B_FALLBACK=1) and mode C (default) by setting env on the shim. - For GPQA:
huggingface-cli login, acceptIdavidrein/gpqaterms in browser, then a 1-pass MCQ harness can be written againstgpqa_diamondconfig (~40 lines, similar to MBPP harness). 198 questions per Diamond split.
These blockers are not runtime issues. The TASK 1 plumbing (organ-first chain, NO_7B_FALLBACK) is already in place; the only thing missing is the data pipe from each official source into the same A/B/C harness shape.