Generalized Parallel Scaling with Interdependent Generations
- URL: http://arxiv.org/abs/2510.01143v1
- Date: Wed, 01 Oct 2025 17:33:35 GMT
- Title: Generalized Parallel Scaling with Interdependent Generations
- Authors: Harry Dong, David Brandfonbrener, Eryk Helenowski, Yun He, Mrinal Kumar, Han Fang, Yuejie Chi, Karthik Abinav Sankararaman,
- Abstract summary: We propose Bridge to generate interdependent responses in parallel.<n>With only a small amount of new parameters, Bridge improves the relative mean accuracy gains from reinforcement learning.
- Score: 58.43994876504917
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Parallel LLM inference scaling involves sampling a set of $N>1$ responses for a single input prompt. However, these $N$ parallel responses tend to be generated independently from each other, partitioning compute resources and leaving potentially useful information in one generation untapped by others. This is in contrast to response length scaling where past computation is used in all future steps. For higher quality responses and response sets, we propose Bridge to generate interdependent responses in parallel by rethinking batched LLM hidden states as holistic tensors rather than independent slices. With only a small amount (2.8%-5.1%) of new parameters, Bridge improves the relative mean accuracy gains from reinforcement learning with verifiable rewards by up to 50% and boosts consistency of correct responses. Trained once, Bridge scales to any generation width, all with greater performance than independent generations, unlocking a more general mode of parallel scaling that effectively leverages information between sequences, compatible with any post-generation aggregation technique.
Related papers
- Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching [66.39914384073145]
We propose a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates.<n>We find that step-level recombination is most beneficial on harder problems.<n>Our training-free framework improves average accuracy by up to 2 across six math and coding tasks.
arXiv Detail & Related papers (2026-02-26T11:08:39Z) - Rethinking Thinking Tokens: LLMs as Improvement Operators [80.12087211785949]
Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which allows them to explore solution strategies with self-checking.<n>This results in higher accuracy, but inflates context length, token/compute cost, and answer latency.<n>We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier?<n>We identify an interesting inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace
arXiv Detail & Related papers (2025-10-01T17:08:59Z) - Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models [85.76129014170778]
Inference-time compute can be scaled in parallel by choosing among multiple independent solutions or sequentially through self-refinement.<n>We propose Recursive Self-Aggregation (RSA), a test-time scaling method inspired by evolutionary methods.
arXiv Detail & Related papers (2025-09-30T17:58:03Z) - Representation Consistency for Accurate and Coherent LLM Answer Aggregation [20.494987341489573]
representation consistency (RC) is a test-time scaling method for aggregating answers drawn from multiple candidate responses of an large language model.<n>RC enhances answer aggregation by considering the number of occurrences of each answer in the candidate response set.<n>Our method only uses cached activations and lightweight similarity computations and requires no additional model queries.
arXiv Detail & Related papers (2025-06-18T05:07:47Z) - Learning to Reason Across Parallel Samples for LLM Reasoning [45.60752271688715]
Scaling test-time compute brings substantial performance gains for large language models.<n>We propose a new way to leverage such multiple sample set.<n>We train a compact LLM, that takes a sequence of multiple samples and output the final answer.
arXiv Detail & Related papers (2025-06-10T17:42:35Z) - Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers [57.95157497749428]
We propose RL$V$ that augments any value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier.<n> RL$V$ boosts MATH accuracy by over 20% with parallel sampling and enables $8-32times$ efficient test-time compute scaling.
arXiv Detail & Related papers (2025-05-07T22:41:26Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Scalable Exploration via Ensemble++ [26.53967194965416]
We propose a scalable exploration framework using a novel shared-factor ensemble architecture with random linear combinations.<n>For linear bandits, we provide theoretical guarantees showing that Ensemble++ achieves regret comparable to exact Thompson Sampling.<n>We extend this theoretical foundation to nonlinear rewards by replacing fixed features with learnable neural representations.
arXiv Detail & Related papers (2024-07-18T06:16:09Z) - Scaling Efficient LLMs [0.0]
"AI scaling law" for transformers suggests that the number of parameters must scale linearly with the size of the data.<n>We propose recurrent transformers, combining the efficacy of transformers with the efficiency of recurrent networks.
arXiv Detail & Related papers (2024-02-22T18:06:19Z) - DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme
Long Sequence Transformer Models [34.74093040678323]
We introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training.
DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention.
Experiments show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.
arXiv Detail & Related papers (2023-09-25T20:15:57Z) - Accelerating Feedforward Computation via Parallel Nonlinear Equation
Solving [106.63673243937492]
Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning.
We frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point method, as well as hybrid methods of both.
Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power.
arXiv Detail & Related papers (2020-02-10T10:11:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.