Debating for Better Reasoning: An Unsupervised Multimodal Approach
- URL: http://arxiv.org/abs/2505.14627v1
- Date: Tue, 20 May 2025 17:18:17 GMT
- Title: Debating for Better Reasoning: An Unsupervised Multimodal Approach
- Authors: Ashutosh Adhikari, Mirella Lapata,
- Abstract summary: We extend the debate paradigm to a multimodal setting, exploring its potential for weaker models to supervise and enhance the performance of stronger models.<n>We focus on visual question answering (VQA), where two "sighted" expert vision-language models debate an answer, while a "blind" (text-only) judge adjudicates based solely on the quality of the arguments.<n>In our framework, the experts defend only answers aligned with their beliefs, thereby obviating the need for explicit role-playing and concentrating the debate on instances of expert disagreement.
- Score: 56.74157117060815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As Large Language Models (LLMs) gain expertise across diverse domains and modalities, scalable oversight becomes increasingly challenging, particularly when their capabilities may surpass human evaluators. Debate has emerged as a promising mechanism for enabling such oversight. In this work, we extend the debate paradigm to a multimodal setting, exploring its potential for weaker models to supervise and enhance the performance of stronger models. We focus on visual question answering (VQA), where two "sighted" expert vision-language models debate an answer, while a "blind" (text-only) judge adjudicates based solely on the quality of the arguments. In our framework, the experts defend only answers aligned with their beliefs, thereby obviating the need for explicit role-playing and concentrating the debate on instances of expert disagreement. Experiments on several multimodal tasks demonstrate that the debate framework consistently outperforms individual expert models. Moreover, judgments from weaker LLMs can help instill reasoning capabilities in vision-language models through finetuning.
Related papers
- MentisOculi: Revealing the Limits of Reasoning with Mental Imagery [63.285794947638614]
We develop MentisOculi, a suite of multi-step reasoning problems amenable to visual solution.<n> evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance.<n>Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning.
arXiv Detail & Related papers (2026-02-02T18:49:06Z) - Latent Debate: A Surrogate Framework for Interpreting LLM Thinking [26.20998021856433]
We introduce latent debate, a novel framework for interpreting model predictions through the lens of implicit internal arguments.<n>We show that latent debate is a faithful structured surrogate model that has highly consistent predictions with the original LLM.<n>Further analysis reveals strong correlations between hallucinations and debate patterns, such as a high degree of latent debates in the middle layers is linked to a higher risk of hallucinations.
arXiv Detail & Related papers (2025-12-01T17:27:31Z) - From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models [36.54062692717823]
Chain-of-Thought (CoT) reasoning has demonstrated significant efficacy in language models by enhancing reasoning transparency and output interpretability.<n>This paper provides a systematic review centered on "Multimodal Chain-of-Thought" (MCoT)
arXiv Detail & Related papers (2025-11-17T01:22:37Z) - MMPersuade: A Dataset and Evaluation Framework for Multimodal Persuasion [73.99171322670772]
Large Vision-Language Models (LVLMs) are increasingly deployed in domains such as shopping, health, and news.<n> MMPersuade provides a unified framework for systematically studying multimodal persuasion dynamics in LVLMs.
arXiv Detail & Related papers (2025-10-26T17:39:21Z) - Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate [2.3027211055417283]
We show that debate can lead to a decrease in accuracy over time.<n>Our analysis reveals that models frequently shift from correct to incorrect answers in response to peer reasoning.<n>These results highlight important failure modes in the exchange of reasons during multi-agent debate.
arXiv Detail & Related papers (2025-09-05T13:47:38Z) - Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes? [14.41230051139575]
This paper introduces Argus Inspection, a multimodal benchmark with two levels of difficulty.<n>We also present the Eye of Panoptes framework, which integrates a binary parametric Sigmoid metric with an indicator function.
arXiv Detail & Related papers (2025-06-03T13:44:14Z) - Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought [83.89629325805505]
We introduce Argus to address limitations with a new visual attention grounding mechanism.<n>Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention.
arXiv Detail & Related papers (2025-05-29T17:59:56Z) - Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1) [66.51642638034822]
Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks.<n>Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains.<n>This paper offers a concise yet insightful overview of reasoning techniques in both textual and multimodal LLMs.
arXiv Detail & Related papers (2025-04-04T04:04:56Z) - Mind with Eyes: from Language Reasoning to Multimodal Reasoning [19.719640188412463]
Language models have recently advanced into the realm of reasoning, yet it is through multimodal reasoning that we can fully unlock the potential to achieve more comprehensive, human-like cognitive capabilities.<n>This survey provides a systematic overview of the recent multimodal reasoning approaches, categorizing them into two levels: language-centric multimodal reasoning and collaborative multimodal reasoning.
arXiv Detail & Related papers (2025-03-23T13:40:44Z) - LATTE: Learning to Think with Vision Specialists [103.5952731807559]
We propose LATTE, a family of vision-language models that offload perception to state-of-the-art vision models.<n>By offloading perception to state-of-the-art vision models, our approach enables vision-language models to focus solely on reasoning over high-quality perceptual information.
arXiv Detail & Related papers (2024-12-07T00:42:04Z) - ACC-Collab: An Actor-Critic Approach to Multi-Agent LLM Collaboration [20.040543142468344]
ACC-Collab is an Actor-Critic based learning framework to produce a two-agent team specialized in collaboration.<n>We demonstrate that ACC-Collab outperforms SotA multi-agent techniques on a wide array of benchmarks.
arXiv Detail & Related papers (2024-10-30T19:09:02Z) - Training Language Models to Win Debates with Self-Play Improves Judge Accuracy [8.13173791334223]
We test the robustness of debate as a method of scalable oversight by training models to debate with data generated via self-play.
We find that language model based evaluators answer questions more accurately when judging models optimized to win debates.
arXiv Detail & Related papers (2024-09-25T05:28:33Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - Debating with More Persuasive LLMs Leads to More Truthful Answers [45.0343254517401]
We find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively.
Our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth.
arXiv Detail & Related papers (2024-02-09T21:05:01Z) - Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution.
Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.