Related papers: MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

URL: http://arxiv.org/abs/2505.13427v1
Date: Mon, 19 May 2025 17:55:08 GMT
Title: MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision
Authors: Lingxiao Du, Fanqing Meng, Zongkai Liu, Zhixiang Zhou, Ping Luo, Qiaosheng Zhang, Wenqi Shao,
Abstract summary: We propose MM-PRM, a process reward model trained within a fully automated, scalable framework.<n>We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data.<n>We generate over 700k step-level annotations without human labeling.
Score: 27.571090189791303
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language understanding, they still struggle with complex multi-step reasoning, often producing logically inconsistent or partially correct solutions. A key limitation lies in the lack of fine-grained supervision over intermediate reasoning steps. To address this, we propose MM-PRM, a process reward model trained within a fully automated, scalable framework. We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data. Then, we construct MM-K12, a curated dataset of 10,000 multimodal math problems with verifiable answers, which serves as seed data. Leveraging a Monte Carlo Tree Search (MCTS)-based pipeline, we generate over 700k step-level annotations without human labeling. The resulting PRM is used to score candidate reasoning paths in the Best-of-N inference setup and achieves significant improvements across both in-domain (MM-K12 test set) and out-of-domain (OlympiadBench, MathVista, etc.) benchmarks. Further analysis confirms the effectiveness of soft labels, smaller learning rates, and path diversity in optimizing PRM performance. MM-PRM demonstrates that process supervision is a powerful tool for enhancing the logical robustness of multimodal reasoning systems. We release all our codes and data at https://github.com/ModalMinds/MM-PRM.

Related papers

GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning [12.724393910603299]
We introduce the Generative Multimodal Process Reward Model (GM-PRM)<n>Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step.<n>We show that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks.
arXiv Detail & Related papers (2025-08-06T05:10:29Z)
DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning [33.574626079343936]
We introduce DreamPRM, a domain-reweighted training framework for multimodal PRMs.<n>In the lower-level optimization, DreamPRM performs fine-tuning on multiple datasets with domain weights.<n>In the upper-level optimization, the PRM is evaluated on a separate meta-learning dataset.
arXiv Detail & Related papers (2025-05-26T17:20:17Z)
Process Reward Models That Think [86.88809596842428]
Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling.<n>This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT)<n>We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs.
arXiv Detail & Related papers (2025-04-23T15:44:54Z)
MM-IFEngine: Towards Multimodal Instruction Following [85.90027280653925]
We present MM-IFEngine, a pipeline to generate high-quality image-instruction pairs.<n> MM-IFInstruct-23k is suitable forSupervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO)<n>We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieves notable gains on various IF benchmarks.
arXiv Detail & Related papers (2025-04-10T17:59:12Z)
Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness [61.87055159919641]
Multi-modal semantic segmentation (MMSS) addresses the limitations of single-modality data by integrating complementary information across modalities.<n>Despite notable progress, a significant gap persists between research and real-world deployment due to variability and uncertainty in multi-modal data quality.<n>We introduce a robustness benchmark that evaluates MMSS models under three scenarios: Entire-Missing Modality (EMM), Random-Missing Modality (RMM), and Noisy Modality (NM)
arXiv Detail & Related papers (2025-03-24T08:46:52Z)
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs)<n>Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks.<n>For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z)
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning [55.82649731348012]
We introduce the MMK12 dataset and MM-EUREKA with 7B and 32B parameters.<n>The former is a high-quality multimodal mathematics reasoning dataset featuring diverse knowledge domains with human-verified answers and solution processes.<n>The latter is a multimodal model employing rule-based reinforcement learning utilizing online filtering and two-stage training strategy to enhance training stability.
arXiv Detail & Related papers (2025-03-10T14:23:12Z)
MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification [20.071520400080022]
We introduce MM-Verifier and MM-Reasoner to enhance multimodal reasoning through longer inference and more robust verification.<n>Our approach achieves strong performance when combining MM-Reasoner and MM-Verifier, reaching an accuracy of 65.3 on MathVista.
arXiv Detail & Related papers (2025-02-19T02:46:52Z)
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics [25.308196207219613]
Chain-of-Thought (CoT) reasoning is widely used to enhance the mathematical reasoning capabilities of large language models (LLMs)<n>In this work, we propose a novel framework that introduces System 2-style thinking to multimodal mathematical reasoning.
arXiv Detail & Related papers (2025-01-08T18:49:41Z)
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.<n>Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.<n>We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.