Related papers: Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge

Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge

URL: http://arxiv.org/abs/2602.09341v1
Date: Tue, 10 Feb 2026 02:24:53 GMT
Title: Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge
Authors: Wei Yang, Shixuan Li, Heng Ping, Peiyu Zhang, Paul Bogdan, Jesse Thomason,
Abstract summary: We introduce AgentAuditor, which replaces voting with a path search over a Reasoning Tree.<n>AgentAuditor resolves conflicts by comparing reasoning branches at critical divergence points.<n>It yields up to 5% absolute accuracy improvement over a majority vote, and up to 3% over using LLM-as-Judge.
Score: 18.843205691780284
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-agent systems (MAS) can substantially extend the reasoning capacity of large language models (LLMs), yet most frameworks still aggregate agent outputs with majority voting. This heuristic discards the evidential structure of reasoning traces and is brittle under the confabulation consensus, where agents share correlated biases and converge on the same incorrect rationale. We introduce AgentAuditor, which replaces voting with a path search over a Reasoning Tree that explicitly represents agreements and divergences among agent traces. AgentAuditor resolves conflicts by comparing reasoning branches at critical divergence points, turning global adjudication into efficient, localized verification. We further propose Anti-Consensus Preference Optimization (ACPO), which trains the adjudicator on majority-failure cases and rewards evidence-based minority selections over popular errors. AgentAuditor is agnostic to MAS setting, and we find across 5 popular settings that it yields up to 5% absolute accuracy improvement over a majority vote, and up to 3% over using LLM-as-Judge.

Related papers

MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems [59.20800753428596]
We present MAS-ProVe, a systematic empirical study of process verification for multi-agent systems (MAS)<n>Our study spans three verification paradigms (LLM-as-a-Judge, reward models, and process reward models)<n>We find that process-level verification does not consistently improve performance and frequently exhibits high variance.
arXiv Detail & Related papers (2026-02-03T03:30:36Z)
ALIGN: Aligned Delegation with Performance Guarantees for Multi-Agent LLM Reasoning [9.381086885165208]
Inference-time ensemble methods can improve performance by sampling diverse reasoning paths or aggregating multiple candidate answers.<n>We propose a novel method, Aligned Delegation for Multi-Agent LLM Reasoning (ALIGN), which formulates LLM reasoning as an aligned delegation game.<n>We establish theoretical guarantees showing that, under a fair comparison with equal access to candidate solutions, ALIGN provably improves expected performance over single-agent generation.
arXiv Detail & Related papers (2026-01-28T00:29:21Z)
OPTAGENT: Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning [14.105640933123325]
Large Language Models (LLMs) have shown remarkable reasoning capabilities in mathematical and scientific tasks.<n>To enhance complex reasoning, multi-agent systems have been proposed to harness the collective intelligence of LLM agents.<n>We propose $ours$, a multi-agent verbal reinforcement learning algorithm that dynamically constructs and refines multi-agent collaboration structures.
arXiv Detail & Related papers (2025-10-20T19:07:51Z)
Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information [57.397381631496906]
We develop two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP)<n>Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions.<n>We empirically validate our algorithms on synthetic datasets, popular LLM fine-tuning benchmarks such as UltraFeedback and MMLU, and a real-world healthcare setting ARMMAN.
arXiv Detail & Related papers (2025-10-01T22:21:50Z)
Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment [22.305033366660187]
Language Models (LMs) are inconsistent reasoners, often generating contradictory responses to identical prompts.<n>We formalize self-consistency as an intrinsic property of well-aligned reasoning models and introduce Multi-Agent Consensus Alignment (MACA)<n>MACA enables agents to teach themselves to be more decisive and concise, and better leverage peer insights in multi-agent settings without external supervision.
arXiv Detail & Related papers (2025-09-18T17:27:28Z)
Byzantine-Robust Decentralized Coordination of LLM Agents [4.097563258332958]
We propose DecentLLMs, a novel decentralized consensus approach for multi-agent LLM systems.<n>Agents generate answers concurrently and evaluator agents independently score and rank these answers to select the best available one.<n> Experimental results demonstrate that DecentLLMs effectively tolerates Byzantine agents and significantly improves the quality of selected answers.
arXiv Detail & Related papers (2025-07-20T11:55:26Z)
Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge [70.89799989428367]
We conduct a systematic analysis of four diverse bias types: position bias, verbosity bias, chain-of-thought bias, and bandwagon bias.<n>We evaluate these biases across two widely adopted multi-agent LLM-as-Judge frameworks: Multi-Agent-Debate and LLM-as-Meta-Judge.
arXiv Detail & Related papers (2025-05-26T03:56:41Z)
When Disagreements Elicit Robustness: Investigating Self-Repair Capabilities under LLM Multi-Agent Disagreements [56.29265568399648]
We argue that disagreements prevent premature consensus and expand the explored solution space.<n>Disagreements on task-critical steps can derail collaboration depending on the topology of solution paths.
arXiv Detail & Related papers (2025-02-21T02:24:43Z)
ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs [61.07130026622437]
Large Language Models (LLMs) still struggle with natural language reasoning tasks. Motivated by the society of minds, we propose ReConcile. A multi-model multi-agent framework designed as a round table conference among diverse LLM agents.
arXiv Detail & Related papers (2023-09-22T17:12:45Z)
Pure Exploration under Mediators' Feedback [63.56002444692792]
Multi-armed bandits are a sequential-decision-making framework, where, at each interaction step, the learner selects an arm and observes a reward. We consider the scenario in which the learner has access to a set of mediators, each of which selects the arms on the agent's behalf according to a and possibly unknown policy. We propose a sequential decision-making strategy for discovering the best arm under the assumption that the mediators' policies are known to the learner.
arXiv Detail & Related papers (2023-08-29T18:18:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.