Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment
- URL: http://arxiv.org/abs/2509.15172v2
- Date: Tue, 30 Sep 2025 19:57:55 GMT
- Title: Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment
- Authors: Ankur Samanta, Akshayaa Magesh, Youliang Yu, Runzhe Wu, Ayush Jain, Daniel Jiang, Boris Vidolov, Paul Sajda, Yonathan Efroni, Kaveh Hassani,
- Abstract summary: Language Models (LMs) are inconsistent reasoners, often generating contradictory responses to identical prompts.<n>We formalize self-consistency as an intrinsic property of well-aligned reasoning models and introduce Multi-Agent Consensus Alignment (MACA)<n>MACA enables agents to teach themselves to be more decisive and concise, and better leverage peer insights in multi-agent settings without external supervision.
- Score: 22.305033366660187
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language Models (LMs) are inconsistent reasoners, often generating contradictory responses to identical prompts. While inference-time methods can mitigate these inconsistencies, they fail to address the core problem: LMs struggle to reliably select reasoning pathways leading to consistent outcomes under exploratory sampling. To address this, we formalize self-consistency as an intrinsic property of well-aligned reasoning models and introduce Multi-Agent Consensus Alignment (MACA), a reinforcement learning framework that post-trains models to favor reasoning trajectories aligned with their internal consensus using majority/minority outcomes from multi-agent debate. These trajectories emerge from deliberative exchanges where agents ground reasoning in peer arguments, not just aggregation of independent attempts, creating richer consensus signals than single-round majority voting. MACA enables agents to teach themselves to be more decisive and concise, and better leverage peer insights in multi-agent settings without external supervision, driving substantial improvements across self-consistency (+27.6% on GSM8K), single-agent reasoning (+23.7% on MATH), sampling-based inference (+22.4% Pass@20 on MATH), and multi-agent ensemble decision-making (+42.7% on MathQA). These findings, coupled with strong generalization to unseen benchmarks (+16.3% on GPQA, +11.6% on CommonsenseQA), demonstrate robust self-alignment that more reliably unlocks latent reasoning potential of language models.
Related papers
- Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge [18.843205691780284]
We introduce AgentAuditor, which replaces voting with a path search over a Reasoning Tree.<n>AgentAuditor resolves conflicts by comparing reasoning branches at critical divergence points.<n>It yields up to 5% absolute accuracy improvement over a majority vote, and up to 3% over using LLM-as-Judge.
arXiv Detail & Related papers (2026-02-10T02:24:53Z) - MentorCollab: Selective Large-to-Small Inference-Time Guidance for Efficient Reasoning [85.05204262206296]
Large reasoning models (LRMs) achieve strong performance by producing long chains of thought, but their inference costs are high.<n>Small language models (SLMs) are far more efficient, yet struggle on multi-step reasoning tasks.<n>We propose MentorCollab, an inference-time collaboration method in which an LRM selectively and sparsely guides an SLM, rather than taking over generation.
arXiv Detail & Related papers (2026-02-05T04:58:16Z) - ORCH: many analyses, one merge-a deterministic multi-agent orchestrator for discrete-choice reasoning with EMA-guided routing [0.6445605125467574]
ORCH is a framework for discrete-choice reasoning that orchestrates heterogeneous language models.<n>It uses fixed rules for task decomposition and answer aggregation, keeping the pipeline predictable, reproducible, and training-free.<n>Experiments on MMLU, MMLU-Pro, and GSM8K show that ORCH consistently outperforms single-model baselines and a majority-vote ensemble.
arXiv Detail & Related papers (2026-02-02T08:27:58Z) - Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning [112.16686518063456]
We introduce textbfMulti-Agent Test-Time Reinforcement Learning (MATTRL), a framework that injects structured textual experience into multi-agent deliberation at inference time.<n>MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making.<n>Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67% over a multi-agent baseline, and by 8.67% over comparable single-agent baselines
arXiv Detail & Related papers (2026-01-14T17:57:43Z) - Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs [23.590034731179824]
We present Through Role Orchestration (Maestro), a principled paradigm for collaboration that structurally decouples cognitive modes.<n>Maestro uses a collective of parallel Execution Agents for diverse exploration and a specialized Central Agent for convergent, evaluative synthesis.<n>Experiments on mathematical reasoning and general problem-solving benchmarks demonstrate that Maestro, coupled with CLPO, consistently outperforms existing state-of-the-art multi-agent approaches.
arXiv Detail & Related papers (2025-11-08T21:01:27Z) - Multi-Agent Debate for LLM Judges with Adaptive Stability Detection [46.67172123607961]
We propose a multi-agent debate judge framework where agents collaboratively reason and iteratively refine their responses.<n>We formalize the debate process mathematically, analyzing agent interactions and proving that debate amplifies correctness compared to static ensembles.<n> Experiments across multiple benchmarks and models demonstrate that our framework improves judgment accuracy over majority voting while maintaining computational efficiency.
arXiv Detail & Related papers (2025-10-14T16:30:30Z) - Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z) - Certainty-Guided Reasoning in Large Language Models: A Dynamic Thinking Budget Approach [0.15749416770494704]
We show that Certainty-Guided Reasoning (CGR) improves baseline accuracy while reducing token usage.<n>CGR can eliminate millions of tokens in aggregate, with tunable trade-offs between certainty thresholds and efficiency.<n>By integrating confidence into the reasoning process, CGR makes large reasoning language models more adaptive, trustworthy, and resource efficient.
arXiv Detail & Related papers (2025-09-09T14:57:15Z) - Retrieval-Augmented Generation with Conflicting Evidence [57.66282463340297]
Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses.<n>In practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources.<n>We propose RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query.
arXiv Detail & Related papers (2025-04-17T16:46:11Z) - Collective Reasoning Among LLMs: A Framework for Answer Validation Without Ground Truth [0.0]
We introduce a new approach in which several advanced large language models produce and answer intricate, doctoral-level probability problems.<n>Our investigation focuses on how agreement among diverse models can signal the reliability of their outputs.
arXiv Detail & Related papers (2025-02-28T06:20:52Z) - When Disagreements Elicit Robustness: Investigating Self-Repair Capabilities under LLM Multi-Agent Disagreements [56.29265568399648]
We argue that disagreements prevent premature consensus and expand the explored solution space.<n>Disagreements on task-critical steps can derail collaboration depending on the topology of solution paths.
arXiv Detail & Related papers (2025-02-21T02:24:43Z) - DebUnc: Improving Large Language Model Agent Communication With Uncertainty Metrics [52.242449026151846]
Multi-agent debates have been introduced to improve the accuracy of Large Language Models (LLMs)<n>We propose DebUnc, a debate framework that uses uncertainty metrics to assess agent confidence.
arXiv Detail & Related papers (2024-07-08T22:15:01Z) - ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs [61.07130026622437]
Large Language Models (LLMs) still struggle with natural language reasoning tasks.
Motivated by the society of minds, we propose ReConcile.
A multi-model multi-agent framework designed as a round table conference among diverse LLM agents.
arXiv Detail & Related papers (2023-09-22T17:12:45Z) - On the Complexity of Multi-Agent Decision Making: From Learning in Games
to Partial Monitoring [105.13668993076801]
A central problem in the theory of multi-agent reinforcement learning (MARL) is to understand what structural conditions and algorithmic principles lead to sample-efficient learning guarantees.
We study this question in a general framework for interactive decision making with multiple agents.
We show that characterizing the statistical complexity for multi-agent decision making is equivalent to characterizing the statistical complexity of single-agent decision making.
arXiv Detail & Related papers (2023-05-01T06:46:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.