The Jolting Technologies Hypothesis confirmed in real-time. Published my paper on AI’s superexponential growth just a few days ago. Now METR reports AI task completion doubling time dropped from 7 months → 4 months. This is EXACTLY what “jolting” means.
Quoting @AISafetyMemes
Incredible chart.
METR found a “Moore’s Law for AI agents”: the length of tasks AIs can do is doubling every 7 months.
They’re now seeing similar rates of improvement across domains!
And it’s SPEEDING UP, not slowing down.
Do you realize the implications of doubling every 4 MONTHS?
If you or someone you love suffers from Exponential Slope Blindness and has AGI timelines that make no sense, please show them this chart. Help is available.
METR’s Findings
@AISafetyMemes is quoting this post from METR:
METR previously estimated that the time horizon of AI agents on software tasks is doubling every 7 months.
We have now analyzed 9 other benchmarks for scientific reasoning, math, robotics, computer use, and self-driving; we observe generally similar rates of improvement.
We analyze data from 9 existing benchmarks: MATH, OSWorld, LiveCodeBench, Mock AIME, GPQA Diamond, Tesla FSD, Video-MME, RLBench, and SWE-Bench Verified, which either include human time data or allow us to estimate it.
The frontier time horizon on different benchmarks differs by >100x. Many reasoning and coding benchmarks cluster at or above 1 hour, but agentic computer use (OSWorld, WebArena) is only ~2 minutes, possibly due to poor tooling.
Since the release of the original report, new frontier models like o3 have been above trend on METR’s tasks, suggesting a doubling time faster than 7 months.
The median doubling time across 9 benchmarks is ~4 months (range is 2.5 - 17 months).
Time horizon isn’t relevant on all benchmarks. Hard LeetCode problems (LiveCodeBench) and math problems (AIME) are much harder for models than easy ones, but Video-MME questions on long videos aren’t much harder than on short ones.
Because we didn’t run these evals ourselves and had to estimate human time (and sometimes other key parameters), these are necessarily rougher estimates of time horizon than our work on HCAST, RE-Bench, and SWAA. GPQA Diamond is an exception, thanks to @EpochAIResearch data.
For more details, see METR’s blog post: How does time horizon vary across domains?
This real-time confirmation of the Jolting Technologies Hypothesis demonstrates how AI capabilities are accelerating at a superexponential rate, exactly as predicted. The shift from 7 months to 4 months doubling time is a clear indicator of jolting behavior in action.