ARC-AGI Benchmark Results
Exploring how different neural network training strategies perform on the Abstraction and Reasoning Corpus (ARC) β a benchmark designed to measure genuine artificial general intelligence through novel reasoning tasks that require learning from just a few examples.
π§ What is ARC-AGI?
The Abstraction and Reasoning Corpus (ARC) is a benchmark created by FranΓ§ois Chollet to measure machine intelligence in a way that goes beyond pattern recognition. Each task presents a few input-output examples, and the system must infer the underlying transformation rule to apply it to a new input β much like an IQ test for AI.
Unlike traditional ML benchmarks, ARC tasks require genuine abstraction: recognizing objects, understanding spatial relationships, counting, symmetry detection, and more β all from just 2-4 examples.
π What We're Measuring
- Stability β How consistent is the accuracy over time? (Higher = more stable)
- Throughput β Samples processed per second during real-time task switching
- Consistency β How reliably does it perform across different tasks?
- Tasks Solved β Number of unique ARC tasks where the model achieved pixel-perfect accuracy
Mode Comparison
Comparing 6 different training strategies in real-time task switching: switching between 400 tasks every 100ms while maintaining accuracy.
π WINNER
β οΈ WORST
βοΈ SCORING ALGORITHM
| Mode | Stability | Throughput | Consistency | Solved | Score |
|---|
Council of 1000
1000 randomized neural network architectures competing to find unique task solutions. Testing statistical saturation: does the discovery curve flatten or keep rising?
π COUNCIL METRICS
β±οΈ RUN INFO
π§ COLLECTIVE WISDOM
π TOP 10 EXPERTS
| Agent | Architecture | Solved |
|---|
Evolutionary Zoo
2500+ mutant architectures with different topologies, brain types, activations, and learning rates β exploring the architectural fitness landscape.
π ZOO METRICS
π HALL OF FAME - TOP 10 MUTANTS
| # | Mutation Config | Solved |
|---|
π§ COLLECTIVE WISDOM
π¬ What Do These Results Mean?
Our experiments reveal several key insights about neural network training strategies for few-shot reasoning tasks:
- StepTweenChain excels at real-time adaptation β By training on every sample without batching, it maintains high accuracy even when rapidly switching between tasks.
- Collective intelligence matters β No single architecture solves all tasks. The Council of 1000 shows that combining diverse architectures covers more of the task space.
- Architecture diversity beats optimization β The Evolutionary Zoo demonstrates that exploring different topologies and brain types yields better coverage than hyperparameter tuning alone.
- Discovery curves are still rising β Adding more architectures continues to discover new solvable tasks, suggesting we haven't hit the ceiling yet.