Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models
- URL: http://arxiv.org/abs/2509.26626v1
- Date: Tue, 30 Sep 2025 17:58:03 GMT
- Title: Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models
- Authors: Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan Obando-Ceron, Yoshua Bengio, Brian R. Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, Nikolay Malkin, Moksh Jain,
- Abstract summary: Inference-time compute can be scaled in parallel by choosing among multiple independent solutions or sequentially through self-refinement.<n>We propose Recursive Self-Aggregation (RSA), a test-time scaling method inspired by evolutionary methods.
- Score: 85.76129014170778
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Test-time scaling methods improve the capabilities of large language models (LLMs) by increasing the amount of compute used during inference to make a prediction. Inference-time compute can be scaled in parallel by choosing among multiple independent solutions or sequentially through self-refinement. We propose Recursive Self-Aggregation (RSA), a test-time scaling method inspired by evolutionary methods that combines the benefits of both parallel and sequential scaling. Each step of RSA refines a population of candidate reasoning chains through aggregation of subsets to yield a population of improved solutions, which are then used as the candidate pool for the next iteration. RSA exploits the rich information embedded in the reasoning chains -- not just the final answers -- and enables bootstrapping from partially correct intermediate steps within different chains of thought. Empirically, RSA delivers substantial performance gains with increasing compute budgets across diverse tasks, model families and sizes. Notably, RSA enables Qwen3-4B-Instruct-2507 to achieve competitive performance with larger reasoning models, including DeepSeek-R1 and o3-mini (high), while outperforming purely parallel and sequential scaling strategies across AIME-25, HMMT-25, Reasoning Gym, LiveCodeBench-v6, and SuperGPQA. We further demonstrate that training the model to combine solutions via a novel aggregation-aware reinforcement learning approach yields significant performance gains. Code available at https://github.com/HyperPotatoNeo/RSA.
Related papers
- Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching [66.39914384073145]
We propose a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates.<n>We find that step-level recombination is most beneficial on harder problems.<n>Our training-free framework improves average accuracy by up to 2 across six math and coding tasks.
arXiv Detail & Related papers (2026-02-26T11:08:39Z) - Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute [60.151643048803145]
We propose Fractional Reasoning, a framework that enables continuous control over reasoning intensity at inference time.<n>Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor.<n> Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.
arXiv Detail & Related papers (2025-06-18T21:15:59Z) - Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach.<n>As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt.<n>By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z) - When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning [90.5036809670993]
Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models.<n>Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task.<n>We evaluate GenRM against Self-Consistency (SC) for most practical inference budgets across diverse models and datasets.
arXiv Detail & Related papers (2025-04-01T17:41:57Z) - Recursive Inference Scaling: A Winning Path to Scalable Inference in Language and Multimodal Systems [21.01887711305712]
We introduce Recursive INference Scaling (RINS) as a complementary, plug-in recipe for scaling inference time in language and multimodal systems.<n>RINS significantly outperforms +55 other variants, including the recent "repeat-all-over" (RAO) strategy in Mobile LLM.<n>With light-weight adapters, RINS offers a no-regret strategy, meaning that RINS-enabled pretraining improves performance in language modeling.
arXiv Detail & Related papers (2025-02-11T12:11:40Z) - Bag of Tricks for Inference-time Computation of LLM Reasoning [10.366475014241407]
We investigate and benchmark diverse inference-time computation strategies across reasoning tasks of varying complexity.<n>Our ablation studies reveal that previously overlooked strategies can significantly enhance performance.<n>We establish a standardized benchmark for inference-time computation by systematically evaluating six representative methods across eight reasoning tasks.
arXiv Detail & Related papers (2025-02-11T02:31:11Z) - Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models [46.959380978972206]
We study inference scaling laws (aka test-time scaling laws) and compute-optimal inference.<n>As a first step towards understanding and designing compute-optimal inference methods, we studied cost-performance trade-offs for inference strategies.<n>Our findings suggest that scaling inference compute with inference strategies can be more computationally efficient than scaling model parameters.
arXiv Detail & Related papers (2024-08-01T17:16:04Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - Berrut Approximated Coded Computing: Straggler Resistance Beyond
Polynomial Computing [34.69732430310801]
We propose Berrut Approximated Coded Computing (BACC) as an alternative approach to deal with stragglers effect.
BACC is proven to be numerically stable with low computational complexity.
In particular, BACC is used to train a deep neural network on a cluster of servers.
arXiv Detail & Related papers (2020-09-17T14:23:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.