Debating for Better Reasoning: An Unsupervised Multimodal Approach
- URL: http://arxiv.org/abs/2505.14627v1
- Date: Tue, 20 May 2025 17:18:17 GMT
- Title: Debating for Better Reasoning: An Unsupervised Multimodal Approach
- Authors: Ashutosh Adhikari, Mirella Lapata,
- Abstract summary: We extend the debate paradigm to a multimodal setting, exploring its potential for weaker models to supervise and enhance the performance of stronger models.<n>We focus on visual question answering (VQA), where two "sighted" expert vision-language models debate an answer, while a "blind" (text-only) judge adjudicates based solely on the quality of the arguments.<n>In our framework, the experts defend only answers aligned with their beliefs, thereby obviating the need for explicit role-playing and concentrating the debate on instances of expert disagreement.
- Score: 56.74157117060815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As Large Language Models (LLMs) gain expertise across diverse domains and modalities, scalable oversight becomes increasingly challenging, particularly when their capabilities may surpass human evaluators. Debate has emerged as a promising mechanism for enabling such oversight. In this work, we extend the debate paradigm to a multimodal setting, exploring its potential for weaker models to supervise and enhance the performance of stronger models. We focus on visual question answering (VQA), where two "sighted" expert vision-language models debate an answer, while a "blind" (text-only) judge adjudicates based solely on the quality of the arguments. In our framework, the experts defend only answers aligned with their beliefs, thereby obviating the need for explicit role-playing and concentrating the debate on instances of expert disagreement. Experiments on several multimodal tasks demonstrate that the debate framework consistently outperforms individual expert models. Moreover, judgments from weaker LLMs can help instill reasoning capabilities in vision-language models through finetuning.
Related papers
- Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes? [14.41230051139575]
This paper introduces Argus Inspection, a multimodal benchmark with two levels of difficulty.<n>We also present the Eye of Panoptes framework, which integrates a binary parametric Sigmoid metric with an indicator function.
arXiv Detail & Related papers (2025-06-03T13:44:14Z) - Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought [83.89629325805505]
We introduce Argus to address limitations with a new visual attention grounding mechanism.<n>Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention.
arXiv Detail & Related papers (2025-05-29T17:59:56Z) - Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1) [66.51642638034822]
Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks.<n>Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains.<n>This paper offers a concise yet insightful overview of reasoning techniques in both textual and multimodal LLMs.
arXiv Detail & Related papers (2025-04-04T04:04:56Z) - Mind with Eyes: from Language Reasoning to Multimodal Reasoning [19.719640188412463]
Language models have recently advanced into the realm of reasoning, yet it is through multimodal reasoning that we can fully unlock the potential to achieve more comprehensive, human-like cognitive capabilities.<n>This survey provides a systematic overview of the recent multimodal reasoning approaches, categorizing them into two levels: language-centric multimodal reasoning and collaborative multimodal reasoning.
arXiv Detail & Related papers (2025-03-23T13:40:44Z) - LATTE: Learning to Think with Vision Specialists [103.5952731807559]
We propose LATTE, a family of vision-language models that offload perception to state-of-the-art vision models.<n>By offloading perception to state-of-the-art vision models, our approach enables vision-language models to focus solely on reasoning over high-quality perceptual information.
arXiv Detail & Related papers (2024-12-07T00:42:04Z) - ACC-Collab: An Actor-Critic Approach to Multi-Agent LLM Collaboration [20.040543142468344]
ACC-Collab is an Actor-Critic based learning framework to produce a two-agent team specialized in collaboration.<n>We demonstrate that ACC-Collab outperforms SotA multi-agent techniques on a wide array of benchmarks.
arXiv Detail & Related papers (2024-10-30T19:09:02Z) - Training Language Models to Win Debates with Self-Play Improves Judge Accuracy [8.13173791334223]
We test the robustness of debate as a method of scalable oversight by training models to debate with data generated via self-play.
We find that language model based evaluators answer questions more accurately when judging models optimized to win debates.
arXiv Detail & Related papers (2024-09-25T05:28:33Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - Debating with More Persuasive LLMs Leads to More Truthful Answers [45.0343254517401]
We find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively.
Our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth.
arXiv Detail & Related papers (2024-02-09T21:05:01Z) - Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution.
Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.