The Majority is not always right: RL training for solution aggregation
- URL: http://arxiv.org/abs/2509.06870v1
- Date: Mon, 08 Sep 2025 16:39:38 GMT
- Title: The Majority is not always right: RL training for solution aggregation
- Authors: Wenting Zhao, Pranjal Aggarwal, Swarnadeep Saha, Asli Celikyilmaz, Jason Weston, Ilia Kulikov,
- Abstract summary: We train an aggregator model to review, reconcile, and synthesize a final, correct answer.<n>A key ingredient is careful balancing of easy and hard training examples.<n>We find our method, AggLM, outperforms both strong rule-based and reward-model baselines.
- Score: 53.1050856072799
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling up test-time compute, by generating multiple independent solutions and selecting or aggregating among them, has become a central paradigm for improving large language models (LLMs) on challenging reasoning tasks. While most prior work relies on simple majority voting or reward model ranking to aggregate solutions, these approaches may only yield limited benefits. In this work, we propose to learn aggregation as an explicit reasoning skill: given a set of candidate solutions, we train an aggregator model to review, reconcile, and synthesize a final, correct answer using reinforcement learning from verifiable rewards. A key ingredient is careful balancing of easy and hard training examples, allowing the model to learn both to recover minority-but-correct answers as well as easy majority-correct answers. Empirically, we find our method, AggLM, outperforms both strong rule-based and reward-model baselines, across multiple benchmarks. Furthermore, it generalizes effectively to solutions from differing models, including stronger ones than contained in the training data, all while requiring substantially fewer tokens than majority voting with larger numbers of solutions.
Related papers
- Learning Generative Selection for Best-of-N [52.88943295436412]
We show that small reasoning models can acquire strong GenSelect capabilities through targeted reinforcement learning.<n>Our results establish reinforcement learning as a scalable way to unlock strong generative selection in small models.
arXiv Detail & Related papers (2026-02-02T14:21:15Z) - GCPO: When Contrast Fails, Go Gold [6.596504114809683]
We introduce Group Contrastive Policy Optimization (GCPO), a method that incorporates external standard reference answers.<n>When the model cannot solve a problem, the reference answer supplies the correct response, steering the model toward an unequivocally accurate update direction.<n>GCPO achieves outstanding results across multiple benchmark datasets, yielding substantial improvements over the baseline model.
arXiv Detail & Related papers (2025-10-09T05:09:06Z) - What Can You Do When You Have Zero Rewards During RL? [3.0795668932789515]
Reinforcement learning (RL) with outcome-based rewards has proven effective for improving large language models (LLMs) on complex reasoning tasks.<n>We study this scenario through the graph search task introduced in Bachmann et al. (2024) and evaluate recent methods that incorporate desirable components.<n>We find that a simple data-centric intervention of adding easier samples to the training set enables the model to eventually solve the original hard task despite starting from zero reward.
arXiv Detail & Related papers (2025-10-04T23:10:38Z) - Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA [10.122669382758122]
We show that when questions are effectively unsolvable for a model, spurious chains of thought (CoTs) are more likely to appear.<n>We adapt outcome-supervised reward models and reinforcement learning with group-relative advantage to incorporate solvability into their objectives.<n>Our results highlight solvability as a key factor for reducing hallucinations and increasing reliability in CoT reasoning.
arXiv Detail & Related papers (2025-09-30T08:34:16Z) - Critique to Verify: Accurate and Honest Test-Time Scaling with RL-Trained Verifiers [63.99316853136304]
Mirror-Critique is a framework that trains a verifier with informative critiques.<n>We deploy a small instruction-tuned model to synthesize high-quality critique data.<n>The resulting Mirror-Verifier is deployed to evaluate candidate solutions by generating multiple critiques per solution.
arXiv Detail & Related papers (2025-09-27T06:50:24Z) - Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs [102.48588475875749]
We introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework.<n>GSR generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution.<n>We show that our method achieves state-of-the-art performance across five mathematical benchmarks.
arXiv Detail & Related papers (2025-08-27T06:51:48Z) - Thinkless: LLM Learns When to Think [57.857534644932194]
Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference.<n>We propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning.<n>On several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%.
arXiv Detail & Related papers (2025-05-19T17:24:16Z) - Rationale-Aware Answer Verification by Pairwise Self-Evaluation [11.763229353978321]
We show that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers.
Our results suggest that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers.
arXiv Detail & Related papers (2024-10-07T08:53:00Z) - Oracle Inequalities for Model Selection in Offline Reinforcement
Learning [105.74139523696284]
We study the problem of model selection in offline RL with value function approximation.
We propose the first model selection algorithm for offline RL that achieves minimax rate-optimal inequalities up to logarithmic factors.
We conclude with several numerical simulations showing it is capable of reliably selecting a good model class.
arXiv Detail & Related papers (2022-11-03T17:32:34Z) - A Mutual Information Maximization Approach for the Spurious Solution
Problem in Weakly Supervised Question Answering [60.768146126094955]
Weakly supervised question answering usually has only the final answers as supervision signals.
There may exist many spurious solutions that coincidentally derive the correct answer, but training on such solutions can hurt model performance.
We propose to explicitly exploit such semantic correlations by maximizing the mutual information between question-answer pairs and predicted solutions.
arXiv Detail & Related papers (2021-06-14T05:47:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.