Related papers: MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification

MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification

URL: http://arxiv.org/abs/2502.13383v1
Date: Wed, 19 Feb 2025 02:46:52 GMT
Title: MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification
Authors: Linzhuang Sun, Hao Liang, Jingxuan Wei, Bihui Yu, Tianpeng Li, Fan Yang, Zenan Zhou, Wentao Zhang,
Abstract summary: We introduce MM-Verifier and MM-Reasoner to enhance multimodal reasoning through longer inference and more robust verification.<n>Our approach achieves strong performance when combining MM-Reasoner and MM-Verifier, reaching an accuracy of 65.3 on MathVista.
Score: 20.071520400080022
License: http://creativecommons.org/licenses/by/4.0/
Abstract: According to the Test-Time Scaling, the integration of External Slow-Thinking with the Verify mechanism has been demonstrated to enhance multi-round reasoning in large language models (LLMs). However, in the multimodal (MM) domain, there is still a lack of a strong MM-Verifier. In this paper, we introduce MM-Verifier and MM-Reasoner to enhance multimodal reasoning through longer inference and more robust verification. First, we propose a two-step MM verification data synthesis method, which combines a simulation-based tree search with verification and uses rejection sampling to generate high-quality Chain-of-Thought (COT) data. This data is then used to fine-tune the verification model, MM-Verifier. Additionally, we present a more efficient method for synthesizing MMCOT data, bridging the gap between text-based and multimodal reasoning. The synthesized data is used to fine-tune MM-Reasoner. Our MM-Verifier outperforms all larger models on the MathCheck, MathVista, and MathVerse benchmarks. Moreover, MM-Reasoner demonstrates strong effectiveness and scalability, with performance improving as data size increases. Finally, our approach achieves strong performance when combining MM-Reasoner and MM-Verifier, reaching an accuracy of 65.3 on MathVista, surpassing GPT-4o (63.8) with 12 rollouts.

Related papers

MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning [4.963955559863751]
MMAT-1M is the first million-scale multimodal agent tuning dataset designed to support CoT, reflection, and dynamic tool usage.<n>Our dataset is constructed through a novel four-stage data engine.<n>By fine-tuning open-source multimodal models on the MMAT-1M, we observe significant performance gains.
arXiv Detail & Related papers (2025-07-29T15:39:14Z)
MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models [4.451479907610764]
This paper introduces MMRefine, a benchmark designed to evaluate the error refinement capabilities of Multimodal Large Language Models (MLLMs)<n>As the emphasis shifts toward enhancing reasoning during inference, MMRefine provides a framework that evaluates MLLMs' abilities to detect and correct errors across six distinct scenarios.<n> Experiments with various open and closed MLLMs reveal bottlenecks and factors impeding refinement performance, highlighting areas for improvement in effective reasoning enhancement.
arXiv Detail & Related papers (2025-06-05T07:11:36Z)
MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision [27.571090189791303]
We propose MM-PRM, a process reward model trained within a fully automated, scalable framework.<n>We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data.<n>We generate over 700k step-level annotations without human labeling.
arXiv Detail & Related papers (2025-05-19T17:55:08Z)
Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness [61.87055159919641]
Multi-modal semantic segmentation (MMSS) addresses the limitations of single-modality data by integrating complementary information across modalities. Despite notable progress, a significant gap persists between research and real-world deployment due to variability and uncertainty in multi-modal data quality. We introduce a robustness benchmark that evaluates MMSS models under three scenarios: Entire-Missing Modality (EMM), Random-Missing Modality (RMM), and Noisy Modality (NM)
arXiv Detail & Related papers (2025-03-24T08:46:52Z)
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z)
Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers [36.1723136776532]
Multi-Agent Verification (MAV) is a test-time compute paradigm that combines multiple verifiers to improve performance. We introduce BoN-MAV, a simple multi-agent verification algorithm that combines best-of-n sampling with multiple verifiers. Our results establish scaling the number of verifiers as a promising new dimension for improving language model performance at test-time.
arXiv Detail & Related papers (2025-02-27T18:53:30Z)
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics [25.308196207219613]
Chain-of-Thought (CoT) reasoning is widely used to enhance the mathematical reasoning capabilities of large language models (LLMs) In this work, we propose a novel framework that introduces System 2-style thinking to multimodal mathematical reasoning.
arXiv Detail & Related papers (2025-01-08T18:49:41Z)
Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs) We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs. We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z)
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark [77.93283927871758]
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities.
arXiv Detail & Related papers (2024-09-04T15:31:26Z)
Synthetic Multimodal Question Generation [60.33494376081317]
Multimodal Retrieval Augmented Generation (MMRAG) is a powerful approach to question-answering over multimodal documents. We propose SMMQG, a synthetic data generation framework that generates question and answer pairs directly from multimodal documents. We use SMMQG to generate an MMRAG dataset of 1024 questions over Wikipedia documents and evaluate state-of-the-art models using it.
arXiv Detail & Related papers (2024-07-02T12:57:42Z)
UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models [76.30799731147589]
We introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference. Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models like GPT-4o and Gemini-Pro significantly enhance their generation quality when their input prompts are augmented with relevant information retrieved by MM retrievers like UniIR models.
arXiv Detail & Related papers (2024-05-16T17:58:45Z)
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM. We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z)
VERITE: A Robust Benchmark for Multimodal Misinformation Detection Accounting for Unimodal Bias [17.107961913114778]
multimodal misinformation is a growing problem on social media platforms. In this study, we investigate and identify the presence of unimodal bias in widely-used MMD benchmarks. We introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data.
arXiv Detail & Related papers (2023-04-27T12:28:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.