A new study reveals AI can double its capacity for complex tasks every few months, suggesting a dramatic evolution in its capabilities. Researchers developed a method that measures AI performance by task duration versus humans, showing AI excels in short tasks but struggles with longer ones. Predictions indicate AIs will significantly impact workflows by 2026, transforming both business operations and daily life management.
In a stunning leap forward, researchers have found that artificial intelligence (AI) is doubling its ability to tackle complex tasks every few months. This explosive growth raises questions about how society will adapt to these rapidly evolving capabilities. A new benchmark aims to measure AI in a whole new way—tracking its performance on tasks by their duration compared to humans. This study was published on March 30 on arXiv, although it hasn’t gone through peer review yet.
A team from the Model Evaluation & Threat Research (METR) group is leading this exploration. They discovered that AI excels in short tasks but struggles with longer ones that require sustained, intricate actions. They noted, “AI agents often seem to struggle with stringing together longer sequences of actions more than they lack skills or knowledge needed to solve single steps.” If we can figure out how long an AI can stay focused, we might just get a clearer picture of its true capabilities.
The study reveals an interesting trend: AI models show almost 100% success in tasks a human could finish in under four minutes. But when it comes to anything longer than four hours, that rate drops to a mere 10%. Not surprisingly, the latest models outperform older versions in handling these extended tasks. Each year, the length of time that general AI can manage a task with half-reliability has essentially doubled every seven months for the past six years.
Assessing AI’s potential involved a veritable Olympic sprint of tasks. Researchers tested models like Sonnet 3.7, GPT-4, and Claude 3 Opus against simple queries and much more difficult feats, like writing CUDA kernels or debugging PyTorch errors. Tools like HCAST, which includes 189 tasks related to software automation, and RE-Bench were essential for this evaluation. They focused on the “messiness” of tasks and employed single-step actions for baseline tempo.
The fascinating conclusion? AI’s so-called “attention span” is rapidly expanding. If this trend continues, the study predicts that by 2032, AIs could automate the equivalent of a month’s worth of human software development. Researchers believe this kind of measure could radically change how we interpret absolute AI performance compared to human capabilities.
Experts have weighed in on this groundbreaking work. Sohrob Kazerounian, a noted AI researcher at Vectra AI, described the new metric as a handy way to gauge AI intelligence and task capability. Measuring time spent on complex tasks, according to Kazerounian, stands as a relevant and useful datapoint for assessing AI’s potential to tackle human-like problems. Also, Eleanor Watson from Singularity University sees this method as a valuable and intuitive shift in understanding AI’s ability to maintain focus amidst hustle and bustle.
Looking ahead, the implications are significant. Experts like Watson predict we might see generalist AI agents capable of managing multi-tasking realities by as early as 2026. These AI systems could make life easier for both businesses and consumers by handling everything from complex work to daily life events like travel planning or financial oversight.
In the coming years, as AIs learn to handle more extensive, real-world tasks, we’re looking at a fundamental change in how society interacts with such technologies. Watson summed it up nicely: “Powerful generalist AI agents — capable of flexibly switching among diverse tasks — will emerge prominently.” And as these systems integrate a variety of skills into streamlined workflows, they just might redefine our daily lives and work styles.
In conclusion, the rapid advancement of AI, evidenced by its doubling capacity to manage complex tasks, opens the door to a new understanding of artificial intelligence’s capabilities. This novel benchmark, focusing on task duration, provides a clearer view of AI’s performance compared to humans. As generalist AI emerges, society may embrace a future where these technologies revolutionize everyday tasks, impacting industries and personal lives alike.
Original Source: www.livescience.com