Agents’ Last Exam

arXiv AI 12 Jun 2026 2 min read

AI Insight

This paper introduces Agents' Last Exam (ALE), a new benchmark designed to evaluate AI agents on long-term, economically valuable real-world tasks across professional domains. Developed with over 250 industry experts, ALE covers 1,000+ tasks organized into 55 subfields across 13 industry clusters based on the U.S. federal occupational taxonomy. Current AI systems achieve less than 1% average pass rates on the hardest tier, revealing a significant gap between benchmark performance and real-world economic deployment capability.

Why it matters

ALE addresses a critical limitation in AI evaluation by measuring performance on actual professional workflows rather than simplified benchmarks, potentially helping identify which AI systems are ready for economically meaningful deployment. This benchmark could accelerate the translation of AI research into practical applications that impact productivity and economic output.

Confidence

6/10Peer-reviewedAI & Computational Science

arXiv:2606.05405v2 Announce Type: replace
Abstract: Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents’ Last Exam (ALE), a benchmark designed to evaluate AI agents on long horizon, economically valuable, real world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 sub fields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is below 1%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP relevant impact.

Source: Agents' Last Exam