CyberdyneLabs · Reports · TASK3_OFFICIAL_BENCHES_STATUS

TASK 3 — official benches status (2026-05-01)

reports/TASK3_OFFICIAL_BENCHES_STATUS.md 560 words raw markdown ↗

TASK 3 — official benches status (2026-05-01)

Per user TASK 3: "Run LiveCodeBench subset, BFCL official/hard subset, GPQA Diamond subset. Use official harness/dataset where possible. If blocked, blocker must be exact: auth/dataset/dependency/GPU/harness."

Per-bench status

LiveCodeBench  ✅ HARNESS READY, BLOCKED ON GPU

BFCL official hard subset  ❌ BLOCKED — DEPENDENCY

`` DataFilesNotFoundError: No (supported) data files found in gorilla-llm/Berkeley-Function-Calling-Leaderboard ``

The official BFCL data is not packaged as a HuggingFace dataset. The official path requires their own gorilla-bfcl Python package (pip install bfcl-eval) and their CLI runner (bfcl run --model <name> --test-category multi_turn_long_context,...). That CLI is built around OpenAI/Anthropic API hosts; pointing it at a local --chat-shaped binary needs custom glue.

GPQA Diamond &nbsp;❌ BLOCKED — AUTH

`` DatasetNotFoundError: Dataset 'Idavidrein/gpqa' is a gated dataset on the Hub. You must be authenticated to access it. ``

Summary

| bench | dataset access | harness | runs after TASK 2 done? | |-----------------|----------------|---------|--------------------------| | LiveCodeBench | ✅ public | ✅ ready | YES — ungated | | BFCL official | ❌ no HF dataset; needs bfcl-eval | not written | gated on dependency install + adapter glue | | GPQA Diamond | ❌ gated | not written | gated on HF_TOKEN + terms acceptance |

What runs as soon as TASK 2 finishes

cd /home/pc/gigachad_native
python3 tools/bench/livecodebench_3mode.py --n 50 --difficulty easy --modes A,B,C
# → reports/LIVECODEBENCH_3MODE_V1.{md,json}

To unblock the other two, the user must:

These blockers are not runtime issues. The TASK 1 plumbing (organ-first chain, NO_7B_FALLBACK) is already in place; the only thing missing is the data pipe from each official source into the same A/B/C harness shape.