Technology

A new, challenging AGI test stumps most AI models

The Arc Prize Basis, a nonprofit co-founded by distinguished AI researcher François Chollet, introduced in a blog post on Monday that it has created a brand new, difficult check to measure the final intelligence of main AI fashions.

Up to now, the brand new check, referred to as ARC-AGI-2, has stumped most fashions.

“Reasoning” AI fashions like OpenAI’s o1-pro and DeepSeek’s R1 rating between 1% and 1.3% on ARC-AGI-2, in accordance with the Arc Prize leaderboard. Highly effective non-reasoning fashions together with GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash rating round 1%.

The ARC-AGI checks include puzzle-like issues the place an AI has to establish visible patterns from a group of different-colored squares, and generate the right “reply” grid. The issues have been designed to power an AI to adapt to new issues it hasn’t seen earlier than.

The Arc Prize Basis had over 400 individuals take ARC-AGI-2 to determine a human baseline. On common, “panels” of those individuals obtained 60% of the check’s questions proper — significantly better than any of the fashions’ scores.

a pattern query from Arc-AGI-2 (credit score: Arc Prize).

In a post on X, Chollet claimed ARC-AGI-2 is a greater measure of an AI mannequin’s precise intelligence than the primary iteration of the check, ARC-AGI-1. The Arc Prize Basis’s checks are geared toward evaluating whether or not an AI system can effectively purchase new expertise outdoors the information it was educated on.

Chollet mentioned that not like ARC-AGI-1, the brand new check prevents AI fashions from counting on “brute power” — intensive computing energy — to seek out options. Chollet beforehand acknowledged this was a major flaw of ARC-AGI-1.

To handle the primary check’s flaws, ARC-AGI-2 introduces a brand new metric: effectivity. It additionally requires fashions to interpret patterns on the fly as a substitute of counting on memorization.

“Intelligence isn’t solely outlined by the power to resolve issues or obtain excessive scores,” Arc Prize Basis co-founder Greg Kamradt wrote in a blog post. “The effectivity with which these capabilities are acquired and deployed is a vital, defining element. The core query being requested is not only, ‘Can AI purchase [the] ability to resolve a job?’ but in addition, ‘At what effectivity or price?’”

ARC-AGI-1 was unbeaten for roughly 5 years till December 2024, when OpenAI launched its advanced reasoning model, o3, which outperformed all different AI fashions and matched human efficiency on the analysis. Nevertheless, as we famous on the time, o3’s performance gains on ARC-AGI-1 came with a hefty price tag.

The model of OpenAI’s o3 mannequin — o3 (low) — that was first to succeed in new heights on ARC-AGI-1, scoring 75.7% on the check, obtained a measly 4% on ARC-AGI-2 utilizing $200 value of computing energy per job.

Comparability of Frontier AI mannequin efficiency on ARC-AGI-1 and ARC-AGI-2 (credit score: Arc Prize).

The arrival of ARC-AGI-2 comes as many within the tech business are calling for brand new, unsaturated benchmarks to measure AI progress. Hugging Face’s co-founder, Thomas Wolf, lately advised TechCrunch that the AI industry lacks sufficient tests to measure the key traits of so-called artificial general intelligence, together with creativity.

Alongside the brand new benchmark, the Arc Prize Basis introduced a new Arc Prize 2025 contest, difficult builders to succeed in 85% accuracy on the ARC-AGI-2 check whereas solely spending $0.42 per job.

Show More

Related Articles

Leave a Reply