A1: Asynchronous Test-Time Scaling via Conformal Prediction
- URL: http://arxiv.org/abs/2509.15148v1
- Date: Thu, 18 Sep 2025 16:55:09 GMT
- Title: A1: Asynchronous Test-Time Scaling via Conformal Prediction
- Authors: Jing Xiong, Qiujiang Chen, Fanghua Ye, Zhongwei Wan, Chuanyang Zheng, Chenyang Zhao, Hui Shen, Alexander Hanbo Li, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, Lingpeng Kong, Ngai Wong,
- Abstract summary: Large language models (LLMs) benefit from test-time scaling, but existing methods face significant challenges.<n>We introduce A1 (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive inference framework that addresses these challenges.<n>A1 achieves a remarkable 56.7x speedup in test-time scaling and a 4.14x improvement in throughput.
- Score: 112.54016379556073
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) benefit from test-time scaling, but existing methods face significant challenges, including severe synchronization overhead, memory bottlenecks, and latency, especially during speculative decoding with long reasoning chains. We introduce A1 (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive inference framework that addresses these challenges. A1 refines arithmetic intensity to identify synchronization as the dominant bottleneck, proposes an online calibration strategy to enable asynchronous inference, and designs a three-stage rejection sampling pipeline that supports both sequential and parallel scaling. Through experiments on the MATH, AMC23, AIME24, and AIME25 datasets, across various draft-target model families, we demonstrate that A1 achieves a remarkable 56.7x speedup in test-time scaling and a 4.14x improvement in throughput, all while maintaining accurate rejection-rate control, reducing latency and memory overhead, and no accuracy loss compared to using target model scaling alone. These results position A1 as an efficient and principled solution for scalable LLM inference. We have released the code at https://github.com/menik1126/asynchronous-test-time-scaling.
Related papers
- Tail-Aware Post-Training Quantization for 3D Geometry Models [58.79500829118265]
Post-Training Quantization (PTQ) enables efficient inference without retraining.<n>PTQ fails to transfer effectively to 3D models due to intricate feature distributions and prohibitive calibration overhead.<n>We propose TAPTQ, a Tail-Aware Post-Training Quantization pipeline for 3D geometric learning.
arXiv Detail & Related papers (2026-02-02T07:21:15Z) - Exploring Test-time Scaling via Prediction Merging on Large-Scale Recommendation [13.057539100440634]
How to efficiently utilize and scale up computational resources during test time remains underexplored.<n>Key point in applying test-time scaling to DLRS lies in effectively generating diverse yet meaningful outputs.<n>Test-time scaling can be seamlessly accelerated with the increase in parallel servers when deployed online.
arXiv Detail & Related papers (2025-12-08T15:41:10Z) - ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models [99.6720868215076]
We introduce ThreadWeaver, a framework for adaptive parallel reasoning.<n> ThreadWeaver achieves accuracy on par with popular sequential reasoning models of comparable size.<n>We show that ThreadWeaver delivers up to 1.53x average speedup in token latency.
arXiv Detail & Related papers (2025-11-24T18:55:59Z) - AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding [35.10915929939651]
Test-time scaling (TTS) boosts LLM reasoning via long chain-of-thought (CoT)<n> KV-cache growth amplifies the memory-bound bottleneck of LLM decoding.<n>We propose AsyncSpade, an asynchronous framework for efficient TTS built on two core components.
arXiv Detail & Related papers (2025-10-08T19:36:11Z) - Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models [51.48680261034029]
Diffusion large language models (dLLMs) generate text through iterative denoising.<n>Current decoding strategies discard rich intermediate predictions in favor of the final output.<n>We introduce two complementary methods that exploit temporal consistency.
arXiv Detail & Related papers (2025-08-12T17:59:57Z) - $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts [55.231201692232894]
$textttSPECS$ is a latency-aware test-time scaling method inspired by speculative decoding.<n>Our results show that $textttSPECS$matches or surpasses beam search accuracy while reducing latency by up to $sim$19.1%.
arXiv Detail & Related papers (2025-06-15T05:50:05Z) - Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation [20.117825519637357]
We introduce Multiverse, a new generative model that enables natively parallel generation.<n>Next, we build a real-world Multiverse reasoning model with co-design curation of data, algorithm, and system.<n>For data creation, we develop Multiverse Curator, an automated LLM-assisted pipeline.<n>We also implement Multiverse Engine to support parallel inference.
arXiv Detail & Related papers (2025-06-11T17:59:23Z) - Adaptive Inference-Time Scaling via Cyclic Diffusion Search [61.42700671176343]
We introduce the challenge of adaptive inference-time scaling-dynamically adjusting computational effort during inference.<n>We propose Adaptive Bi-directional Cyclic Diffusion (ABCD), a flexible, search-based inference framework.<n>ABCD refines outputs through bi-directional diffusion cycles while adaptively controlling exploration depth and termination.
arXiv Detail & Related papers (2025-05-20T07:31:38Z) - Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [61.85289698610747]
We study whether o1-like large language models (LLMs) truly possess test-time scaling capabilities.<n>We find that longer CoTs of these o1-like models do not consistently enhance accuracy.<n>We propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics.
arXiv Detail & Related papers (2025-02-17T07:21:11Z) - Asynchronous Distributed Optimization with Delay-free Parameters [9.062164411594175]
This paper develops asynchronous versions of two distributed algorithms, Prox-DGD and DGD-ATC, for solving consensus optimization problems over undirected networks.
In contrast to alternatives, our algorithms can converge to the fixed point set of their synchronous counterparts using step-sizes that are independent of the delays.
arXiv Detail & Related papers (2023-12-11T16:33:38Z) - Robust Fully-Asynchronous Methods for Distributed Training over General Architecture [11.480605289411807]
Perfect synchronization in distributed machine learning problems is inefficient and even impossible due to the existence of latency, package losses and stragglers.
We propose Fully-Asynchronous Gradient Tracking method (R-FAST), where each device performs local computation and communication at its own without any form of impact.
arXiv Detail & Related papers (2023-07-21T14:36:40Z) - AuxAdapt: Stable and Efficient Test-Time Adaptation for Temporally
Consistent Video Semantic Segmentation [81.87943324048756]
In video segmentation, generating temporally consistent results across frames is as important as achieving frame-wise accuracy.
Existing methods rely on optical flow regularization or fine-tuning with test data to attain temporal consistency.
This paper presents an efficient, intuitive, and unsupervised online adaptation method, AuxAdapt, for improving the temporal consistency of most neural network models.
arXiv Detail & Related papers (2021-10-24T07:07:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.