Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic
- URL: http://arxiv.org/abs/2601.16486v1
- Date: Fri, 23 Jan 2026 06:28:52 GMT
- Title: Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic
- Authors: Yichuan Ma, Linyang Li, Yongkang chen, Peiji Li, Xiaozhe Li, Qipeng Guo, Dahua Lin, Kai Chen,
- Abstract summary: We propose Timely Machine, redefining test-time as wall-clock time.<n>We introduce Timely-Eval, a benchmark spanning high-frequency tool calls, low-frequency tool calls, and time-constrained reasoning.<n>We find smaller models excel with fast feedback through more interactions, while larger models dominate high-latency settings via superior interaction quality.
- Score: 72.97800570813175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) increasingly tackle complex reasoning tasks, test-time scaling has become critical for enhancing capabilities. However, in agentic scenarios with frequent tool calls, the traditional generation-length-based definition breaks down: tool latency decouples inference time from generation length. We propose Timely Machine, redefining test-time as wall-clock time, where models dynamically adjust strategies based on time budgets. We introduce Timely-Eval, a benchmark spanning high-frequency tool calls, low-frequency tool calls, and time-constrained reasoning. By varying tool latency, we find smaller models excel with fast feedback through more interactions, while larger models dominate high-latency settings via superior interaction quality. Moreover, existing models fail to adapt reasoning to time budgets. We propose Timely-RL to address this gap. After cold-start supervised fine-tuning, we use reinforcement learning to enhance temporal planning. Timely-RL improves time budget awareness and consistently boosts performance across Timely-Eval. We hope our work offers a new perspective on test-time scaling for the agentic era.
Related papers
- TimeART: Towards Agentic Time Series Reasoning via Tool-Augmentation [27.695097439296948]
TimeART is a framework fusing the analytical capability of strong out-of-the-box tools and the reasoning capability of Large Language Models (LLMs)<n>To teach the LLM-based Time Series Reasoning Models (TSRMs) strategic tool-use, we also collect a 100k expert trajectory corpus called TimeToolBench.<n>To enhance TSRMs' generalization capability, we devise a four-stage training strategy, which boosts TSRMs through learning from their own early experiences and self-reflections.
arXiv Detail & Related papers (2026-01-20T06:39:10Z) - Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models [57.49136894315871]
New paradigm of test-time scaling has yielded remarkable breakthroughs in reasoning models and generative vision models.<n>We propose one solution to the problem of integrating test-time scaling knowledge into a model during post-training.<n>We replace reward guided test-time noise optimization in diffusion models with a Noise Hypernetwork that modulates initial input noise.
arXiv Detail & Related papers (2025-08-13T17:33:37Z) - $\ exttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts [55.231201692232894]
$textttSPECS$ is a latency-aware test-time scaling method inspired by speculative decoding.<n>Our results show that $textttSPECS$matches or surpasses beam search accuracy while reducing latency by up to $sim$19.1%.
arXiv Detail & Related papers (2025-06-15T05:50:05Z) - Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models [21.579319926212296]
Large Language Models (LLMs) have emerged as powerful tools for generating coherent text, understanding context, and performing reasoning tasks.<n>They struggle with temporal reasoning, which requires processing time-related information such as event sequencing, durations, and inter-temporal relationships.<n>We introduce TISER, a novel framework that enhances the temporal reasoning abilities of LLMs through a multi-stage process that combines timeline construction with iterative self-reflection.
arXiv Detail & Related papers (2025-04-07T16:51:45Z) - Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [61.85289698610747]
We study whether o1-like large language models (LLMs) truly possess test-time scaling capabilities.<n>We find that longer CoTs of these o1-like models do not consistently enhance accuracy.<n>We propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics.
arXiv Detail & Related papers (2025-02-17T07:21:11Z) - A Comparative Study of Pruning Methods in Transformer-based Time Series Forecasting [0.07916635054977067]
Pruning is an established approach to reduce neural network parameter count and save compute.<n>We study the effects of these pruning strategies on model predictive performance and computational aspects like model size, operations, and inference time.<n>We demonstrate that even with corresponding hardware and software support, structured pruning is unable to provide significant time savings.
arXiv Detail & Related papers (2024-12-17T13:07:31Z) - Temporal Feature Matters: A Framework for Diffusion Model Quantization [105.3033493564844]
Diffusion models rely on the time-step for the multi-round denoising.<n>We introduce a novel quantization framework that includes three strategies.<n>This framework preserves most of the temporal information and ensures high-quality end-to-end generation.
arXiv Detail & Related papers (2024-07-28T17:46:15Z) - Gated Recurrent Neural Networks with Weighted Time-Delay Feedback [55.596897987498174]
We present a novel approach to modeling long-term dependencies in sequential data by introducing a gated recurrent unit (GRU) with a weighted time-delay feedback mechanism.<n>Our proposed model, named $tau$-GRU, is a discretized version of a continuous-time formulation of a recurrent unit, where the dynamics are governed by delay differential equations (DDEs)
arXiv Detail & Related papers (2022-12-01T02:26:34Z) - Towards Spatio-Temporal Aware Traffic Time Series Forecasting--Full
Version [37.09531298150374]
Traffic series forecasting is challenging due to complex time series patterns for the same time series patterns may vary across time, where, for example, there exist periods across a day showing stronger temporal correlations.
Such-temporal models employ a shared parameter space irrespective of the time locations and the time periods and they assume that the temporal correlations are similar across locations and do not always hold across time which may not always be the case.
We propose a framework that aims at turning ICD-temporal aware models to encode sub-temporal models.
arXiv Detail & Related papers (2022-03-29T16:44:56Z) - Autoformer: Decomposition Transformers with Auto-Correlation for
Long-Term Series Forecasting [68.86835407617778]
Autoformer is a novel decomposition architecture with an Auto-Correlation mechanism.
In long-term forecasting, Autoformer yields state-of-the-art accuracy, with a relative improvement on six benchmarks.
arXiv Detail & Related papers (2021-06-24T13:43:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.