Related papers: Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment

Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment

URL: http://arxiv.org/abs/2509.15172v2
Date: Tue, 30 Sep 2025 19:57:55 GMT
Title: Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment
Authors: Ankur Samanta, Akshayaa Magesh, Youliang Yu, Runzhe Wu, Ayush Jain, Daniel Jiang, Boris Vidolov, Paul Sajda, Yonathan Efroni, Kaveh Hassani,
Abstract summary: Language Models (LMs) are inconsistent reasoners, often generating contradictory responses to identical prompts.<n>We formalize self-consistency as an intrinsic property of well-aligned reasoning models and introduce Multi-Agent Consensus Alignment (MACA)<n>MACA enables agents to teach themselves to be more decisive and concise, and better leverage peer insights in multi-agent settings without external supervision.
Score: 22.305033366660187
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language Models (LMs) are inconsistent reasoners, often generating contradictory responses to identical prompts. While inference-time methods can mitigate these inconsistencies, they fail to address the core problem: LMs struggle to reliably select reasoning pathways leading to consistent outcomes under exploratory sampling. To address this, we formalize self-consistency as an intrinsic property of well-aligned reasoning models and introduce Multi-Agent Consensus Alignment (MACA), a reinforcement learning framework that post-trains models to favor reasoning trajectories aligned with their internal consensus using majority/minority outcomes from multi-agent debate. These trajectories emerge from deliberative exchanges where agents ground reasoning in peer arguments, not just aggregation of independent attempts, creating richer consensus signals than single-round majority voting. MACA enables agents to teach themselves to be more decisive and concise, and better leverage peer insights in multi-agent settings without external supervision, driving substantial improvements across self-consistency (+27.6% on GSM8K), single-agent reasoning (+23.7% on MATH), sampling-based inference (+22.4% Pass@20 on MATH), and multi-agent ensemble decision-making (+42.7% on MathQA). These findings, coupled with strong generalization to unseen benchmarks (+16.3% on GPQA, +11.6% on CommonsenseQA), demonstrate robust self-alignment that more reliably unlocks latent reasoning potential of language models.

Related papers

Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge [18.843205691780284]
We introduce AgentAuditor, which replaces voting with a path search over a Reasoning Tree.<n>AgentAuditor resolves conflicts by comparing reasoning branches at critical divergence points.<n>It yields up to 5% absolute accuracy improvement over a majority vote, and up to 3% over using LLM-as-Judge.
arXiv Detail & Related papers (2026-02-10T02:24:53Z)
MentorCollab: Selective Large-to-Small Inference-Time Guidance for Efficient Reasoning [85.05204262206296]
Large reasoning models (LRMs) achieve strong performance by producing long chains of thought, but their inference costs are high.<n>Small language models (SLMs) are far more efficient, yet struggle on multi-step reasoning tasks.<n>We propose MentorCollab, an inference-time collaboration method in which an LRM selectively and sparsely guides an SLM, rather than taking over generation.
arXiv Detail & Related papers (2026-02-05T04:58:16Z)
ORCH: many analyses, one merge-a deterministic multi-agent orchestrator for discrete-choice reasoning with EMA-guided routing [0.6445605125467574]
ORCH is a framework for discrete-choice reasoning that orchestrates heterogeneous language models.<n>It uses fixed rules for task decomposition and answer aggregation, keeping the pipeline predictable, reproducible, and training-free.<n>Experiments on MMLU, MMLU-Pro, and GSM8K show that ORCH consistently outperforms single-model baselines and a majority-vote ensemble.
arXiv Detail & Related papers (2026-02-02T08:27:58Z)
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning [112.16686518063456]
We introduce textbfMulti-Agent Test-Time Reinforcement Learning (MATTRL), a framework that injects structured textual experience into multi-agent deliberation at inference time.<n>MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making.<n>Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67% over a multi-agent baseline, and by 8.67% over comparable single-agent baselines
arXiv Detail & Related papers (2026-01-14T17:57:43Z)
Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs [23.590034731179824]
We present Through Role Orchestration (Maestro), a principled paradigm for collaboration that structurally decouples cognitive modes.<n>Maestro uses a collective of parallel Execution Agents for diverse exploration and a specialized Central Agent for convergent, evaluative synthesis.<n>Experiments on mathematical reasoning and general problem-solving benchmarks demonstrate that Maestro, coupled with CLPO, consistently outperforms existing state-of-the-art multi-agent approaches.
arXiv Detail & Related papers (2025-11-08T21:01:27Z)
Multi-Agent Debate for LLM Judges with Adaptive Stability Detection [46.67172123607961]
We propose a multi-agent debate judge framework where agents collaboratively reason and iteratively refine their responses.<n>We formalize the debate process mathematically, analyzing agent interactions and proving that debate amplifies correctness compared to static ensembles.<n> Experiments across multiple benchmarks and models demonstrate that our framework improves judgment accuracy over majority voting while maintaining computational efficiency.
arXiv Detail & Related papers (2025-10-14T16:30:30Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
Certainty-Guided Reasoning in Large Language Models: A Dynamic Thinking Budget Approach [0.15749416770494704]
We show that Certainty-Guided Reasoning (CGR) improves baseline accuracy while reducing token usage.<n>CGR can eliminate millions of tokens in aggregate, with tunable trade-offs between certainty thresholds and efficiency.<n>By integrating confidence into the reasoning process, CGR makes large reasoning language models more adaptive, trustworthy, and resource efficient.
arXiv Detail & Related papers (2025-09-09T14:57:15Z)
Retrieval-Augmented Generation with Conflicting Evidence [57.66282463340297]
Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses.<n>In practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources.<n>We propose RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query.
arXiv Detail & Related papers (2025-04-17T16:46:11Z)
Collective Reasoning Among LLMs: A Framework for Answer Validation Without Ground Truth [0.0]
We introduce a new approach in which several advanced large language models produce and answer intricate, doctoral-level probability problems.<n>Our investigation focuses on how agreement among diverse models can signal the reliability of their outputs.
arXiv Detail & Related papers (2025-02-28T06:20:52Z)
When Disagreements Elicit Robustness: Investigating Self-Repair Capabilities under LLM Multi-Agent Disagreements [56.29265568399648]
We argue that disagreements prevent premature consensus and expand the explored solution space.<n>Disagreements on task-critical steps can derail collaboration depending on the topology of solution paths.
arXiv Detail & Related papers (2025-02-21T02:24:43Z)
DebUnc: Improving Large Language Model Agent Communication With Uncertainty Metrics [52.242449026151846]
Multi-agent debates have been introduced to improve the accuracy of Large Language Models (LLMs)<n>We propose DebUnc, a debate framework that uses uncertainty metrics to assess agent confidence.
arXiv Detail & Related papers (2024-07-08T22:15:01Z)
ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs [61.07130026622437]
Large Language Models (LLMs) still struggle with natural language reasoning tasks. Motivated by the society of minds, we propose ReConcile. A multi-model multi-agent framework designed as a round table conference among diverse LLM agents.
arXiv Detail & Related papers (2023-09-22T17:12:45Z)
On the Complexity of Multi-Agent Decision Making: From Learning in Games to Partial Monitoring [105.13668993076801]
A central problem in the theory of multi-agent reinforcement learning (MARL) is to understand what structural conditions and algorithmic principles lead to sample-efficient learning guarantees. We study this question in a general framework for interactive decision making with multiple agents. We show that characterizing the statistical complexity for multi-agent decision making is equivalent to characterizing the statistical complexity of single-agent decision making.
arXiv Detail & Related papers (2023-05-01T06:46:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.