Self-evaluation report · scorecard dated 2026-05-18

ADAM on ARC-AGI-3:
self-evaluation and world-model learning report.

ARC-style interactive tasks are useful because they exercise world model, exploration, action, memory, and adaptation in one loop. We use them as an engineering gate for ADAM — not as a claim that AGI is solved.

Claim discipline. This is a non-official self-evaluation scorecard. It is not a Kaggle submission and not a leaderboard rank claim. We publish it because the cognitive-loop signal (world model + action + memory) is useful regardless of any official ranking.
Self-eval score
6.77%

Self-evaluation scorecard dated 2026-05-18. Substrate-only ADAM, no LLM, no human assist.

Levels reached
25 / 183

Levels completed across the 25 interactive environments under the substrate-only configuration.

Actions used
2,673

Total actions across the run. Used internally as a baseline for action efficiency targets.

Benchmark · ARC-AGI-3 Environments · 25 Total levels · 183 Internet access · NO Source-code access · NO Status · Self-evaluation

Why this matters, and what we are not claiming.

The useful signal is not a rank. The useful signal is that ADAM was put into interactive tasks where the same engine has to: build a model of the environment from scratch, decide what to try next, remember what already happened, and pursue goals over time. That is exactly the cognitive loop the engine is being engineered to close.

What we claim. ADAM reaches 25 of 183 levels (6.77%) with 2,673 actions on the public ARC-AGI-3 environments under a substrate-only configuration. The scorecard is dated 2026-05-18 and the run is reproducible.

What we do not claim.

  • Not a Kaggle submission. Not a leaderboard rank claim. Not an "official ARC Prize" position.
  • Not a claim that ADAM is AGI or ASI. ADAM is on an AGI track and is not AGI today.
  • Not a "general intelligence solved" claim. ARC-AGI-3 is one behavioural gate, not the final answer.
  • Not a marketing "#1" line. The percentage is small and the work to close the loop is ongoing.

Reference context — other reported numbers in the space.

This is not a rank table. It is a context block so a reader has rough numbers to compare against. Other entries are reproduced from publicly disclosed reports / open-source repos; we do not control those sources.

SolverConfigurationReported levelsReported %
ADAM (this report) Substrate-only, no LLM, no human, self-evaluation 25 / 183 6.77%
StochasticGoose CNN frame-change predictor (community report) ~23 / 183 ~12.58%
Frontier LLM (Opus-class, max effort) No game-specific solver (community report) ~4 / 183 ~2.19%
Frontier LLMs (general, no ARC solver) Out-of-the-box, no scaffolding < 1 / 183 < 0.5%

Note: percentage by levels here is a coarse comparison axis. Different solvers also differ in actions-per-level efficiency, configuration restrictions, and what counts as "with/without LLM". Treat as orientation, not as a ranking.

What is ARC-AGI-3?

ARC-AGI-3 (Abstraction and Reasoning Corpus, generation 3) is the 2026 interactive benchmark from ARC Prize, the research lab founded by François Chollet. It is the successor to ARC-AGI-1 (2019) and ARC-AGI-2 (2024).

Key difference from ARC-AGI-2

ARC-AGI-1 and -2 were static: each task was an input grid → output grid puzzle, solved in one shot. ARC-AGI-3 is interactive — the solver receives a stream of frames, takes discrete actions (click, keyboard, keyboard_click), and the environment responds. This forces planning, exploration, memory of state transitions, and goal pursuit — closer to embodied agency than to one-shot puzzle reasoning.

Why interactive tasks are useful for ADAM

ADAM is being engineered to close a cognitive loop: perceive → world model → memory/belief → reasoning paths → action/planner → observe outcome → verify/correct → self-curriculum → updated memory. Interactive ARC tasks pressure the entire loop in a single run, which is exactly the behavioural gate we want.

The 25 environments

183 levels across 25 small game-like environments. Each has its own rules to be inferred from frames + action feedback — no source-code access, no documentation. Examples include light-switch logic boards, gravity puzzles, block sorters, snake-style collection games.

How ADAM plays — substrate-only configuration.

For this scorecard ADAM runs as a cognitive engine without an LLM and without a human. The runner calls HTTP endpoints — /game_search_init, /game_search_expand, /game_search_next, /game_procedure_learn — and ADAM's substrate decides scores, action priors, and frontier dedup. Implementation detail is in the technical appendix below; the headline is the loop shape, not the formula.

Reproduce locally

python3 arc_agi3_runner/adam_grid_agent.py \
    --hard-cap 1000 --adam-url http://127.0.0.1:8080 \
    --substrate-only
# → self-evaluation scorecard: 25 / 183, 2,673 actions

Technical appendix

ADAM's internal layers — geometric (Clifford-style) over concept embeddings, dual-torus dynamics for goal pursuit, biological-routing flow, and parallel hypothesis evaluation on GPU — are described on the ADAM page and in research reports. They are described there because they are how the loop is implemented, not as headline claims here.

Full ADAM writeup →

Frequently asked.

Is this a Kaggle / ARC Prize leaderboard rank submission?

No. This page is a self-evaluation scorecard published by CyberdyneLabs. The score (6.77%, 25 of 183 levels, 2,673 actions, 2026-05-18) is a non-official internal measurement used as an engineering gate for ADAM's cognitive loop. It is not a submitted Kaggle entry and not a leaderboard rank claim.

What is ARC-AGI-3?

ARC-AGI-3 is the third generation of the Abstraction and Reasoning Corpus benchmark, operated by ARC Prize. It contains 25 interactive video-game-style environments totalling 183 levels, designed to test general fluid intelligence rather than memorisation. Solvers must learn each game's rules from scratch, with no internet access and no source-code access.

Why publish a 6.77% number?

Because honesty is cheaper than retraction. The number is small, the task is hard, and the useful signal is the closed loop — experience changes memory, memory changes procedure selection, procedure selection improves future runs. We want the public number on the record so internal progress can be measured against it.

How does ARC-AGI-3 differ from ARC-AGI-2?

ARC-AGI-2 was static — a single input grid mapping to a single output grid per puzzle. ARC-AGI-3 is interactive: the solver receives a stream of frames, takes actions, and the environment responds. This requires planning, exploration, memory of state transitions, and goal pursuit — closer to embodied agency than to one-shot puzzle solving.

What is ADAM exactly?

ADAM is a local C++ cognitive engine being engineered toward proto-AGI: durable memory, reasoning paths, world model, action, verification, self-correction, and self-curriculum. It is not a rented API and not a GPT wrapper. Architecture and reports live at /adam and in /r/.

Are the other numbers on this page official?

No. The reference-context table reproduces numbers from publicly disclosed reports / open-source repos. We do not control those sources. The table is provided as orientation; do not treat it as a ranking.

Will you update this when ADAM improves?

Yes. This page is dated; new self-evaluation runs replace the score with a new dated scorecard, and the previous number stays in the report archive at /r/.

Explore further.

This page is a single behavioural-gate report. The deeper system writeups live on these adjacent pages: