sprint-econtai/Temporal Coherence Blog Post Caveats.md
Félix Dorn 43076bcbb1 old
2025-07-15 00:41:05 +02:00

5.2 KiB
Raw Permalink Blame History

3. Caveats and Future Work

Our project has some important limitations, both theoretical and empirical. The caveats below offer room for improvement in future iterations of the project, as well as new lines of investigation.

3.1 Limitations of the theoretical approach

3.1.1 Temporal coherence as a concept

Our theoretical framework is based on the concept of temporal coherence, which we characterise as the ability of putting together, consistently, different atomistic tasks in the pursuit of a more complex task. However, this is not the only possible framework for understanding AIs capabilities. For instance, some people have used the t-AGI framework, which bears some resemblance to ours but is not identical. Also, we dont offer a formal definition of the concept. One way in which we plan to expand the project further is by incorporating it more explicitly into the model proposed by Korinek and Suh

One objection that could be raised to the concept of temporal coherence comes from the jagged-frontier of AI systems. This is the observation that AI capabilities are very irregular while they can perform tasks that would take a human many hours in mere seconds or minutes, they struggle with seemingly easy, short-term tasks. We plan to incorporate this concern in our framework in the future.

3.1.2 Temporal coherence as a significant bottleneck

We make an additional claim, if AIs could act coherently over arbitrary time periods of time, putting together the different atomistic tasks they are already capable of performing, more economic value would be unlocked than if any other ability was solved.

However, it is plausible to believe other bottlenecks are equally or more important. Other important bottlenecks include memory, context, multimodality and cooperation. We aim to dedicate more time to identifying the most important bottlenecks to job automation in our future research, possibly with the help of an economic model like the O-Ring model.

3.1.3 Task length as a proxy for coherence

Our measure for temporal coherence is the length it would take a human to complete a given task. The idea behind the proxy is that the longer the task, the more it requires putting together different pieces or atomistic tasks to complete it. But this is not a perfect measure. One way in which the proxy could be misleading is if a task takes a lot of time for a human to complete, not because it requires more coherence for putting together different subtasks, but rather because the atomistic tasks themselves take a lot of time to complete. To some extent, this seems contingent on how narrowly we define atomistic tasks; if done narrowly enough, as Korinek and Suh do in their original paper, then task length seems like a good proxy, because no single task would take a lot of time to complete, by definition. But theres room for disagreement, and the measure could be improved.

Finally, the concern of the jagged-frontier also applies here, because if AI systems are very irregular, its not clear that measuring the time it would take a human to complete a task is very informative of the temporal coherence required by AIs.

3.2 Limitations of the estimates of task length

3.2.1 LLM classification limitations

Humans have a hard time estimating how long something takes, and LLMs too. In our test bench, we evaluated LLMs self-consistency, and found that the more the task statement seemed ambiguous, the more the model changed its estimate between a few values, but stayed consistent between those values.

We also manually estimated the time to completion for 45 task statements, and compared our results to the model, its estimates were close enough to ours.

In the future, we might try to cross-validate the estimates by developing additional methods for time estimation. There is prior work in this area. We might also use a model harnessed with estimation tools, or spend time developing a larger validation set so that we can evaluate our prompt and model choice more accurately.

3.2.3 METRs estimates

We use METRs estimates on the increasing capacity of AI agents to perform longer time-horizon tasks, doubling every 7 months, to achieve the result that more than 60 per cent of tasks can be performed by AI agents by 2026.

The problem with using those estimates is that they are based on a set of coding benchmarks, which, as others have commented, do not reflect the messiness of real-world work environments. Also, the extent to which we can extrapolate the performance of AI agents in coding tasks to other types of tasks is an open question, especially considering how reasoning models are particularly good at coding and math, as opposed to other fields. Finally, we dont take into account the success rate variable in METRs results, which for obvious reasons seems important when it comes to automating real-world tasks.