Related papers: DeepPrune: Parallel Scaling without Inter-trace Redundancy

DeepPrune: Parallel Scaling without Inter-trace Redundancy

URL: http://arxiv.org/abs/2510.08483v1
Date: Thu, 09 Oct 2025 17:24:54 GMT
Title: DeepPrune: Parallel Scaling without Inter-trace Redundancy
Authors: Shangqing Tu, Yaxuan Li, Yushi Bai, Lei Hou, Juanzi Li,
Abstract summary: Over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation.<n>We propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning.<n>Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient.
Score: 53.62015294143274
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Parallel scaling has emerged as a powerful paradigm to enhance reasoning capabilities in large language models (LLMs) by generating multiple Chain-of-Thought (CoT) traces simultaneously. However, this approach introduces significant computational inefficiency due to inter-trace redundancy -- our analysis reveals that over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation. To address this critical efficiency bottleneck, we propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning. Our method features a specialized judge model trained with focal loss and oversampling techniques to accurately predict answer equivalence from partial reasoning traces which realizes 0.87 AUROC on equivalence prediction, combined with an online greedy clustering algorithm that dynamically prunes redundant paths while preserving answer diversity. Comprehensive evaluations across three challenging benchmarks (AIME 2024, AIME 2025, and GPQA) and multiple reasoning models demonstrate that DeepPrune achieves remarkable token reduction by over 80% compared to conventional consensus sampling on most cases, while maintaining competitive accuracy within 3 percentage points. Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient. Our code and data are here: https://deepprune.github.io/

Related papers

Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing [76.48164395646019]
Parallel-Probe is a training-free controller designed to optimize online parallel thinking.<n>It reduces sequential tokens by up to $textbf35.8$% and total token cost by over $textbf25.8$% while maintaining competitive accuracy.
arXiv Detail & Related papers (2026-02-03T18:59:41Z)
Breaking the Overscaling Curse: Thinking Parallelism Before Parallel Thinking [14.561556728044918]
In system-level evaluations, a global parallelism level N is allocated to all samples, typically set large to maximize overall dataset accuracy.<n>Some samples can achieve comparable performance with a smaller N' N, causing budget redundancy.<n>We formalize and quantify the overscaling curse, showing its universality and severity in practice, and analyze its trigger mechanism.
arXiv Detail & Related papers (2026-01-29T12:22:45Z)
ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models [99.6720868215076]
We introduce ThreadWeaver, a framework for adaptive parallel reasoning.<n> ThreadWeaver achieves accuracy on par with popular sequential reasoning models of comparable size.<n>We show that ThreadWeaver delivers up to 1.53x average speedup in token latency.
arXiv Detail & Related papers (2025-11-24T18:55:59Z)
Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z)
Learning Adaptive Parallel Reasoning with Language Models [70.1745752819628]
We propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end.<n> APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations.<n>A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures.
arXiv Detail & Related papers (2025-04-21T22:29:02Z)
Robustness of deep learning classification to adversarial input on GPUs: asynchronous parallel accumulation is a source of vulnerability [4.054484966653432]
A key measure of machine learning (ML) classification models' safety and reliability is their ability to resist small, targeted input perturbations.<n>We show that floating-point non-associativity coupled with asynchronous parallel programming on GPU is sufficient to result in misclassification.<n>We also show that standard adversarial robustness results may be overestimated up to 4.6 when not considering machine-level details.
arXiv Detail & Related papers (2025-03-21T14:19:45Z)
Dynamic Parallel Tree Search for Efficient LLM Reasoning [102.16694475391665]
Tree of Thoughts (ToT) enhances Large Language Model (LLM) reasoning by structuring problem-solving as a spanning tree.<n>We propose Dynamic Parallel Tree Search (DPTS), a novel parallelism framework that aims to dynamically optimize the reasoning path in inference.<n> Experiments on Qwen-2.5 and Llama-3 with Math500 and GSM8K datasets show that DPTS significantly improves efficiency by 2-4x on average.
arXiv Detail & Related papers (2025-02-22T14:13:37Z)
A Partial Regularization Method for Network Compression [0.0]
We propose an approach of partial regularization rather than the original form of penalizing all parameters, which is said to be full regularization, to conduct model compression at a higher speed. Experimental results show that as we expected, the computational complexity is reduced by observing less running time in almost all situations. Surprisingly, it helps to improve some important metrics such as regression fitting results and classification accuracy in both training and test phases on multiple datasets.
arXiv Detail & Related papers (2020-09-03T00:38:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.