5.2 KiB
3. Caveats and Future Work
Our project has some important limitations, both theoretical and empirical. The caveats below offer room for improvement in future iterations of the project, as well as new lines of investigation.
3.1 Limitations of the theoretical approach
3.1.1 Temporal coherence as a concept
Our theoretical framework is based on the concept of temporal coherence, which we characterise as the ability of putting together, consistently, different atomistic tasks in the pursuit of a more complex task. However, this is not the only possible framework for understanding AIs’ capabilities. For instance, some people have used the t-AGI framework, which bears some resemblance to ours but is not identical. Also, we don’t offer a formal definition of the concept. One way in which we plan to expand the project further is by incorporating it more explicitly into the model proposed by Korinek and Suh
One objection that could be raised to the concept of temporal coherence comes from the ‘jagged-frontier’ of AI systems. This is the observation that AI capabilities are very irregular – while they can perform tasks that would take a human many hours in mere seconds or minutes, they struggle with seemingly easy, short-term tasks. We plan to incorporate this concern in our framework in the future.
3.1.2 Temporal coherence as a significant bottleneck
We make an additional claim, if AIs could act coherently over arbitrary time periods of time, putting together the different atomistic tasks they are already capable of performing, more economic value would be unlocked than if any other ability was ‘solved.’
However, it is plausible to believe other bottlenecks are equally or more important. Other important bottlenecks include memory, context, multimodality and cooperation. We aim to dedicate more time to identifying the most important bottlenecks to job automation in our future research, possibly with the help of an economic model like the O-Ring model.
3.1.3 Task length as a proxy for coherence
Our measure for temporal coherence is the length it would take a human to complete a given task. The idea behind the proxy is that the longer the task, the more it requires putting together different pieces or atomistic tasks to complete it. But this is not a perfect measure. One way in which the proxy could be misleading is if a task takes a lot of time for a human to complete, not because it requires more coherence for putting together different subtasks, but rather because the atomistic tasks themselves take a lot of time to complete. To some extent, this seems contingent on how narrowly we define atomistic tasks; if done narrowly enough, as Korinek and Suh do in their original paper, then task length seems like a good proxy, because no single task would take a lot of time to complete, by definition. But there’s room for disagreement, and the measure could be improved.
Finally, the concern of the ‘jagged-frontier’ also applies here, because if AI systems are very irregular, it’s not clear that measuring the time it would take a human to complete a task is very informative of the temporal coherence required by AIs.
3.2 Limitations of the estimates of task length
3.2.1 LLM classification limitations
Humans have a hard time estimating how long something takes, and LLMs too. In our test bench, we evaluated LLM’s self-consistency, and found that the more the task statement seemed ambiguous, the more the model changed its estimate between a few values, but stayed consistent between those values.
We also manually estimated the time to completion for 45 task statements, and compared our results to the model, its estimates were close enough to ours.
In the future, we might try to cross-validate the estimates by developing additional methods for time estimation. There is prior work in this area. We might also use a model harnessed with estimation tools, or spend time developing a larger validation set so that we can evaluate our prompt and model choice more accurately.
3.2.3 METR’s estimates
We use METR’s estimates on the increasing capacity of AI agents to perform longer time-horizon tasks, doubling every 7 months, to achieve the result that more than 60 per cent of tasks can be performed by AI agents by 2026.
The problem with using those estimates is that they are based on a set of coding benchmarks, which, as others have commented, do not reflect the messiness of real-world work environments. Also, the extent to which we can extrapolate the performance of AI agents in coding tasks to other types of tasks is an open question, especially considering how reasoning models are particularly good at coding and math, as opposed to other fields. Finally, we don’t take into account the success rate variable in METR’s results, which for obvious reasons seems important when it comes to automating real-world tasks.