Kinetics: Rethinking Test-Time Scaling Laws
- URL: http://arxiv.org/abs/2506.05333v3
- Date: Fri, 20 Jun 2025 01:25:25 GMT
- Title: Kinetics: Rethinking Test-Time Scaling Laws
- Authors: Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Yang Zhou, Emma Strubell, Beidi Chen,
- Abstract summary: Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones.<n>Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples.
- Score: 18.325591438335007
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential and increasingly important with more computing invested, for realizing the full potential of test-time scaling where, unlike training, accuracy has yet to saturate as a function of computation, and continues to improve through increased generation. The code is available at https://github.com/Infini-AI-Lab/Kinetics.
Related papers
- $\ exttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts [55.231201692232894]
$textttSPECS$ is a latency-aware test-time scaling method inspired by speculative decoding.<n>Our results show that $textttSPECS$matches or surpasses beam search accuracy while reducing latency by up to $sim$19.1%.
arXiv Detail & Related papers (2025-06-15T05:50:05Z) - Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z) - Scaling over Scaling: Exploring Test-Time Scaling Pareto in Large Reasoning Models [7.2703757624760526]
We introduce the Test-Time Scaling Performance Model (TTSPM)<n>We theoretically analyze two fundamental paradigms for such extended scaling, parallel scaling and sequential scaling.<n>We validate our theoretical findings on challenging reasoning benchmarks, including AIME, MATH-500, and GPQA.
arXiv Detail & Related papers (2025-05-26T20:58:45Z) - Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence [38.30075427255948]
Test-Time Scaling (TTS) methods for enhancing Large Language Model (LLM) reasoning often incur substantial computational costs.<n>This paper introduces Guided by Gut (GG), an efficient self-guided TTS framework that achieves PRM-level performance without costly external verifier models.
arXiv Detail & Related papers (2025-05-23T18:19:09Z) - Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement [22.801244105119025]
We propose new algorithms to improve token-efficient reasoning with small-scale models by effectively trading off accuracy and computation.<n>We first show that the post-SFT model fails to determine the optimal stopping point of the reasoning process, resulting in verbose and repetitive outputs.<n>Experiments on four reasoning benchmarks, MATH500, AMC, AIME24 and OlympiadBench, demonstrate that TS is highly effective compared to s1's budget forcing approach.
arXiv Detail & Related papers (2025-05-12T18:04:39Z) - Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [61.85289698610747]
We study whether o1-like large language models (LLMs) truly possess test-time scaling capabilities.<n>We find that longer CoTs of these o1-like models do not consistently enhance accuracy.<n>We propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics.
arXiv Detail & Related papers (2025-02-17T07:21:11Z) - Value-Based Deep RL Scales Predictably [100.21834069400023]
We show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior.<n>We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym.
arXiv Detail & Related papers (2025-02-06T18:59:47Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Beyond neural scaling laws: beating power law scaling via data pruning [37.804100045519846]
We show how in theory we can break beyond power law scaling and potentially even reduce it to exponential scaling.
We develop a new simple, cheap and scalable self-supervised pruning metric that demonstrates comparable performance to the best supervised metrics.
Overall, our work suggests that the discovery of good data-pruning metrics may provide a viable path forward to substantially improved neural scaling laws.
arXiv Detail & Related papers (2022-06-29T09:20:47Z) - Deep Equilibrium Optical Flow Estimation [80.80992684796566]
Recent state-of-the-art (SOTA) optical flow models use finite-step recurrent update operations to emulate traditional algorithms.
These RNNs impose large computation and memory overheads, and are not directly trained to model such stable estimation.
We propose deep equilibrium (DEQ) flow estimators, an approach that directly solves for the flow as the infinite-level fixed point of an implicit layer.
arXiv Detail & Related papers (2022-04-18T17:53:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.