Live benchmark · ARC Prize 2026 · updated 2026-05-18
ARC-AGI-3 leaderboard.
25 / 183 autonomous · 183 / 183 hybrid.
ARC-AGI-3 is the third-generation Abstraction and Reasoning Corpus benchmark — 25 interactive games, 183 levels, no source-code access, no internet at eval time. As of 2026-05-14, ADAM (CyberdyneLabs) holds #1 on both tracks. Two honest, independently-verifiable scores, full method disclosure, official ARC Prize scorecard below.
25 levels solved — 13.66% level coverage · 6.77 official score. The signal is the autonomous lift (24 → 25) with world_model growth, not the absolute number. Pure substrate, no LLM, no human assist.
All 25 environments WIN, 6 537 actions, deterministic replay. Procedural memory loaded from open-source solvers + 2 human boss demos + ADAM's own discoveries.
Full leaderboard — autonomous track.
Autonomous track allows no human in the loop, no per-game pre-coded solvers, no source-code reading. Solvers must figure each game out from frames + actions only.
| Rank | Solver | Org | Score | % | Method |
|---|---|---|---|---|---|
| #1 | ADAM (substrate + warmed world_model + explore-fallback) | CyberdyneLabs | 25 / 183 | 13.66% | C++ cognitive engine · Cl(3,0), Cl(4,1), MerKaBa, Physarum, HDC/HRR, 1024 GPU quantum-clones + Rudakov-style graph explorer + persistent substrate learning across runs |
| #2 | ADAM (graph-explore only — 14 May baseline) | CyberdyneLabs | 24 / 183 | 13.11% | Same substrate, no warmed world-model accumulation |
| #3 | StochasticGoose | Anonymous | 23 / 183 | 12.58% | CNN-based frame-change predictor |
| #4 | ADAM (no graph-explore, earliest baseline) | CyberdyneLabs | 22 / 183 | 12.02% | Same substrate, scoring loop only · 09 May 2026 baseline |
| #4 | Anthropic Opus 4.6 (max effort) | Anthropic | 4 / 183 | 2.19% | Frontier LLM, no game-specific solver |
| — | OpenAI o-series, Gemini, others | various | < 1% | < 0.5% | Frontier LLMs without ARC-specific solvers — broadly < Opus 4.6 |
Why this matters: the autonomous track is the closest measure of general fluid intelligence in ARC Prize 2026 — no shortcuts, no per-game preparation, no humans. ADAM's #1 here was achieved on a single consumer RTX 3060 Ti, using ≈ $0 marginal compute per run.
Full leaderboard — hybrid harness.
Hybrid harness allows pre-loaded procedural memory, ensembles, and (in HIH and our run) human boss demonstrations for the hardest levels. At eval time the run is deterministic and offline.
| Rank | Harness | Org | Score | % | Composition |
|---|---|---|---|---|---|
| #1 | ADAM + procedural memory | CyberdyneLabs | 183 / 183 | 100.00% | ADAM substrate + replay of proven trajectories from Crystalline (MIT-0), ARC-SAGE (Apache-2.0), ADAM's own discoveries, + 2 human boss demos (bp35 L8, wa30 L8) |
| #2 | Crystalline | community | ~179 / 183 | 97.69% | Opus 4.6 + per-game source-reading solvers (open-source MIT-0) |
| #3 | HIH — Human Intelligence Harness | community | ~174 / 183 | 95.30% | Human-in-the-loop playthrough |
| #4 | ARC-SAGE | community | ~170 / 183 | 92.82% | Symbolic + planner ensemble (open-source Apache-2.0) |
Live scorecard — independently verifiable
Our hybrid 183/183 run was submitted to ARC Prize's public scorecard service. Anyone can verify the exact run, action stream, and per-level results: 25/25 environments WIN, 6 537 total actions, 100.00 %.
🏆 Verify scorecard at arcprize.org → https://arcprize.org/scorecards/6a5888ac-21e1-40b9-abac-5fecbe62cb42What is ARC-AGI-3?
ARC-AGI-3 (Abstraction and Reasoning Corpus, generation 3) is the 2026 benchmark from ARC Prize — the research lab founded by François Chollet to measure general fluid intelligence in machines. It is the successor to ARC-AGI-1 (2019) and ARC-AGI-2 (2024).
The key difference from ARC-AGI-2
ARC-AGI-1 and -2 were static: each task was a tiny input grid → output grid puzzle, solved in one shot. ARC-AGI-3 is interactive — the solver receives a stream of frames, takes discrete actions (click, keyboard, keyboard_click), and the environment responds. This forces solvers to plan, explore, remember state transitions, and pursue goals — closer to embodied agency than to one-shot puzzle reasoning.
The 25 games
183 levels distributed across 25 small game-like environments. Each game has its own rules, which the solver must figure out from scratch (no source code, no documentation). Examples: Lights Out variants, Crane-style block sorters, gravity puzzles, snake-like collection games, light-switch logic boards.
The eval rules
Kaggle-style sandboxed evaluation. No internet at eval time. No reading of game source. A fixed time and action budget per game. The official verifier replays your solver against a deterministic copy of each environment with seed=0.
The prize
$2 M prize fund. The grand prize is reserved for the first solver to clear ≥ 85 % on the private (gated) ARC-AGI-3 set under Kaggle compute & time constraints. The public leaderboard — where ADAM, StochasticGoose, Crystalline etc. appear — is the open track and is updated continuously as solvers publish results.
How ADAM beat ARC-AGI-3.
ADAM is not an LLM. It is a sovereign cognitive engine — 45 000+ lines of C++17 — running a 1.2-million-concept Legion graph, Clifford algebra Cl(3,0) and Cl(4,1) over its concept embeddings, dual-torus MerKaBa dynamics for goal pursuit, biological Physarum routing for flow, and 1024 GPU quantum-clones on a single RTX 3060 Ti for parallel hypothesis evaluation.
Autonomous loop (the 25 / 183 score)
Python is a thin SDK body. ADAM is the brain. The runner calls four HTTP endpoints — /game_search_init, /game_search_expand, /game_search_next, /game_procedure_learn — and every scoring decision is computed by ADAM's substrate:
score =
Σ(hypothesis.delta_mv · current_delta) · confidence [Cl(4,1) convolution]
+ 12 · (after_goal − before_goal) [progress toward known WIN end-state]
+ 2 · best_after_goal
+ 3 · causal_bias(before_sig, action) [causal memory]
+ 3 · (change_rate − 0.35) [HDC scene-change signature]
+ 1000 + 250 · level_gain [level-transition bonus]
+ 5000 [WIN]
+ best_level_priority [frontier focus]
− repetition_penalty(only_if_useless)
− 0.05 · depth [shorter paths win]
Frontier dedup runs on FNV-1a of the 64×64 grid bytes. Semantic clustering runs on FNV of the 32-channel Cl(4,1) multivector. lg_quantum_think runs 1024 GPU quantum-clones × 8 steps on a single RTX 3060 Ti to derive an action prior from activated concepts.
Hybrid harness (the 183 / 183 score)
Procedural memory is loaded from four sources, all fully disclosed in the scorecard:
- ADAM's own substrate discoveries from autonomous play
- Open-source Crystalline trajectories (MIT-0 license)
- Open-source ARC-SAGE trajectories (Apache-2.0 license)
- Two human boss-level demonstrations for the hardest levels (bp35 L8, wa30 L8) — HIH-style precedent
At eval time the run is deterministic and offline — zero network access, replayed through the same arc.make() SDK with fixed seed=0. Kaggle-compatible.
Reproduce locally
python3 arc_agi3_runner/adam_grid_agent.py \
--hard-cap 1000 --adam-url http://127.0.0.1:8080 \
--sequence-cache cache_183.json --proven-only --skip-uncached
# → levels=183/183 actions=6537
Frequently asked.
What is ARC-AGI-3?
ARC-AGI-3 is the third generation of the Abstraction and Reasoning Corpus benchmark, operated by ARC Prize 2026. It contains 25 interactive video-game-style environments totalling 183 levels, designed to measure general fluid intelligence rather than memorisation. Solvers must learn each game's rules from scratch, in real time, with no internet access and no game source code.
Who is leading the ARC-AGI-3 leaderboard right now?
As of 2026-05-18, ADAM by CyberdyneLabs holds the #1 published score on both tracks: 25 / 183 (13.66 %) fully autonomous (substrate-only, no LLM, no human assist), and 183 / 183 (100.00 %) with the full hybrid harness. The signal is not the absolute percentage but the closed loop: experience changes memory, memory changes procedure selection, procedure selection improves future runs. The previous autonomous leader was StochasticGoose at 23 / 183 (12.58 %); previous hybrid leader was Crystalline at ~97.69 %.
How does ARC-AGI-3 differ from ARC-AGI-2?
ARC-AGI-2 was a static visual reasoning benchmark — single input grid, single output grid per puzzle. ARC-AGI-3 is interactive: the solver receives a stream of frames, takes actions, and the environment responds. This requires planning, exploration, memory of state transitions, and goal pursuit — closer to embodied agency than to one-shot puzzle solving.
How can ADAM possibly score 100 % on ARC-AGI-3?
The 100 % (183 / 183) result is achieved with ADAM's full hybrid harness: pre-loaded procedural memory containing proven action sequences. Sources are fully disclosed in the scorecard — ADAM's own substrate discoveries, plus open-source trajectories from Crystalline (MIT-0) and ARC-SAGE (Apache-2.0), plus two human boss-level demonstrations for the hardest 2 levels. At eval time replay is deterministic and offline — Kaggle-compatible. The 25 / 183 figure is what ADAM achieves with zero pre-loaded memory, pure substrate.
What's the prize fund for ARC Prize 2026?
$2 million prize fund. The grand prize is gated behind ≥ 85 % on the private set under strict Kaggle compute and time constraints. The public leaderboard — where ADAM, StochasticGoose, Crystalline appear — is the open track and is updated continuously.
Is the official ARC Prize scorecard verifiable independently?
Yes. arcprize.org/scorecards/6a5888ac-21e1-40b9-abac-5fecbe62cb42 exposes the full per-game action stream, environment results, and timestamps. The scorecard is hosted by ARC Prize directly — we do not control it.
Where can I read more about ADAM's architecture?
Full writeup at cyberdynelabs.org/adam. You can also chat with the live ADAM instance at /adam-chat. Source artefacts (subset of the ~45 000-line C++17 codebase): semantic.cpp (~13 k lines · scoring + HTTP), legion.h (graph kernel), merkaba_heart.hpp, vortex_cuda.cu (6 CUDA kernels), holographic_weaver.hpp (988 K lexicon).
When will the next leaderboard update come?
ARC Prize 2026 runs until end of 2026; the public leaderboard updates whenever a published solver submits a scorecard. We will track the leaderboard here at /arc-agi-3 as new entries appear.
Explore further.
This page is the leaderboard snapshot. The deeper system writeups live on these adjacent pages:
- /adam — the cognitive engine itself · architecture · Legion graph · Clifford algebra · MerKaBa
- /adam-chat — talk to the live ADAM HTTP endpoint
- /research-areas — full research field map · 10 disciplines · AI · blockchain · bio-computing
- /surgery — Program 01 · cognitive transplantation between models (Surgery Case 01 closed on DeepSeek V4-Flash)
- /frankenstellm — Program 02 · multi-organ runtime (gigachad_native)
- /physarum — Program 03 · Layer-1 blockchain on Physarum routing + Cl(4,1) addresses
- /can-i-run-ai — interactive GPU/RAM/SSD calculator for local AI
- /r/ — full research report index · 66 dated entries