AI Is Getting Better at Long, Complex Tasks Faster Than We Think — But What Does That Mean For Real Work? / AI处理复杂长任务的速度比我们想象的更快,但这对实际工作意味着什么?

AI Is Getting Better at Long, Complex Tasks Faster Than We Think — But What Does That Mean For Real Work? / AI处理复杂长任务的速度比我们想象的更快,但这对实际工作意味着什么?

I recently came across a fascinating study from METR that tracks what they call the "time horizon" of large language models, and the trend line is honestly pretty eye-opening. For context, the time horizon metric measures how long a human expert would take to complete a task that an AI can solve with a 50% success rate. It’s a clever way to quantify not just how good AI is at quick trivia or short writing tasks, but how well it can tackle extended, complex work that takes humans hours to finish.

我最近看到了METR发布的一项很有意思的研究,他们提出了「时间视野」这个指标来衡量大语言模型的能力,背后的增长趋势其实挺让人惊讶的。简单来说,这个指标指的是:如果AI完成某类任务的成功率是50%,那么对应人类专家完成同类任务需要花费的时长。这个指标的巧妙之处在于,它不再只关注AI回答问题、写短文这类即时任务的能力,而是能量化AI处理需要花费人类数小时的复杂长任务的水平。

Article illustration

The study found that this time horizon has been doubling roughly every seven months, and the pace might be speeding up even more recently. To put that in concrete terms: GPT-2 back in 2019 only had a time horizon of 2 seconds, while the latest frontier models already hit nearly 2 hours last year, and newer internal test models are reportedly hitting 12 hours already. If the trend holds, we could see frontier AIs hit a 50% success rate on tasks that take human experts a full month to complete as soon as 2027. That’s not some far-off sci-fi future — that’s three to seven years from now.

研究发现,这个时间视野大概每7个月就会翻一倍,最近的增长速度甚至可能还在加快。具体来看,2019年的GPT-2的时间视野只有2秒,去年最新的前沿模型已经达到了近2小时,内部测试的新模型甚至已经能做到12小时。如果这个趋势保持下去,最快2027年,前沿AI就能完成人类专家需要花一整个月才能做完的任务,成功率达到50%。这根本不是什么遥远的科幻未来,离现在也就3到7年的时间。

Article illustration

Of course, there are a lot of important caveats here that keep this from being doomsday or utopia overnight. The study’s authors are very upfront that their benchmark tasks are much neater than real-world work. Real software projects, for example, don’t have clear success metrics, you can make irreversible mistakes that cost real money, and there are tons of unwritten team conventions no benchmark can capture. METR’s own past research found that half of AI-written pull requests that pass all automated tests still get rejected by human maintainers, for missing context or breaking unspoken rules. That gap between clean lab benchmarks and messy real life is still really wide.

当然,这里也有很多重要的限制条件,所以我们既不用过度焦虑也不用过分乐观。研究作者也明确提到,他们的基准测试任务比真实世界的工作要规整得多。比如真实的软件项目没有明确的成功指标,你可能犯下无法挽回的错误造成实际损失,还有大量测试无法覆盖的团队隐性规则。METR自己之前的研究就发现,AI写的代码合并请求哪怕通过了所有自动化测试,还是有一半会被人类维护者驳回,要么是缺了上下文,要么是违反了不成文的规则。实验室的干净测试和混乱的真实世界之间,差距还是很大的。

One of the most thought-provoking bits of the paper, though, is the throwaway question from one of the researchers’ colleagues: if AI time horizons keep growing, what happens to how we build software? Right now, we spend so much time on modular design, clean code, and maintainability because our human brains can only hold so much complexity at once. If an AI can keep an entire massive system in its head at once, maybe we don’t need all that structure anymore. We might be moving from an era of hand-crafted, built-to-last software to something more like disposable, quick-to-remake tools, cheap enough to throw away and rebuild from scratch whenever we need. It makes you wonder: what other parts of our work and lives are structured around the limits of human attention, and how will those shift when AI doesn’t share those limits?


来源:https://emptysqua.re/blog/review-measuring-ai-ability-to-complete-long-software-tasks/