ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models
- URL: http://arxiv.org/abs/2512.07843v1
- Date: Mon, 24 Nov 2025 18:55:59 GMT
- Title: ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models
- Authors: Long Lian, Sida Wang, Felix Juefei-Xu, Tsu-Jui Fu, Xiuyu Li, Adam Yala, Trevor Darrell, Alane Suhr, Yuandong Tian, Xi Victoria Lin,
- Abstract summary: We introduce ThreadWeaver, a framework for adaptive parallel reasoning.<n> ThreadWeaver achieves accuracy on par with popular sequential reasoning models of comparable size.<n>We show that ThreadWeaver delivers up to 1.53x average speedup in token latency.
- Score: 99.6720868215076
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but inherently sequential decoding leads to substantial latency, especially on complex tasks. Recent work on adaptive parallel reasoning aims to improve inference efficiency by decomposing the problem-solving process into concurrent reasoning threads when beneficial. However, existing methods on realistic tasks are either limited to supervised behavior cloning or exhibit significant accuracy drops compared to widely-used sequential long chain-of-thought (CoT) baselines. Moreover, many require customized inference engines, complicating deployment. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that achieves accuracy on par with popular sequential reasoning models of comparable size while significantly reducing inference latency. ThreadWeaver's performance stems from three key innovations: 1) a two-stage parallel trajectory generator that produces large-scale, high-quality CoT data with parallel annotations for supervised fine-tuning; 2) a trie-based training-inference co-design that enables parallel reasoning on any off-the-shelf autoregressive inference engine without modifying position embeddings or KV caches; and 3) a parallelization-aware reinforcement learning framework that teaches the model to balance accuracy with effective parallelization. Across six challenging mathematical reasoning benchmarks, ThreadWeaver trained atop Qwen3-8B achieves accuracy comparable to cutting-edge sequential reasoning models (71.9% on average and 79.9% on AIME24) while delivering up to 1.53x average speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.
Related papers
- Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing [76.48164395646019]
Parallel-Probe is a training-free controller designed to optimize online parallel thinking.<n>It reduces sequential tokens by up to $textbf35.8$% and total token cost by over $textbf25.8$% while maintaining competitive accuracy.
arXiv Detail & Related papers (2026-02-03T18:59:41Z) - Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning [68.9332598692234]
We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities.<n>NPR transforms the model from sequential emulation to native parallel cognition through three key innovations.
arXiv Detail & Related papers (2025-12-08T11:39:43Z) - DeepPrune: Parallel Scaling without Inter-trace Redundancy [53.62015294143274]
Over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation.<n>We propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning.<n>Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient.
arXiv Detail & Related papers (2025-10-09T17:24:54Z) - ATTS: Asynchronous Test-Time Scaling via Conformal Prediction [112.54016379556073]
Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency.<n>We introduce ATTS (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive scaling framework.<n>We show that ATTS delivers up to 56.7x speedup in test-time scaling and a 4.14x throughput improvement.
arXiv Detail & Related papers (2025-09-18T16:55:09Z) - ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs [34.477777651648914]
Large language models (LLMs) pose significant inference latency challenges due to their autoregressive decoding paradigm.<n>We propose an Adaptive Serial-Parallel Decoding (ASPD) which addresses two core challenges: automated construction of parallelizable data and efficient parallel decoding mechanism.<n>Our framework sets a groundbreaking benchmark for efficient LLM parallel inference, paving the way for its deployment in latency-sensitive applications such as AI-powered customer service bots and answer retrieval engines.
arXiv Detail & Related papers (2025-08-12T12:35:55Z) - Adaptive Termination for Multi-round Parallel Reasoning: An Universal Semantic Entropy-Guided Framework [12.361554676966552]
Recent advances in large language models (LLMs) have accelerated progress toward artificial general intelligence.<n>We aim to design a flexible test-time collaborative inference framework that exploits the complementary strengths of both sequential and parallel reasoning paradigms.
arXiv Detail & Related papers (2025-07-09T13:28:35Z) - Learning Adaptive Parallel Reasoning with Language Models [70.1745752819628]
We propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end.<n> APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations.<n>A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures.
arXiv Detail & Related papers (2025-04-21T22:29:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.