VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
- URL: http://arxiv.org/abs/2503.10291v1
- Date: Thu, 13 Mar 2025 12:03:37 GMT
- Title: VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
- Authors: Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, Lewei Lu, Haodong Duan, Yu Qiao, Jifeng Dai, Wenhai Wang,
- Abstract summary: We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs)<n>Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks.<n>For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
- Score: 76.35753243272521
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in https://internvl.github.io/blog/2025-03-13-VisualPRM/.
Related papers
- GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning [35.429904556288996]
We introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification.
Experimental results show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset.
arXiv Detail & Related papers (2025-04-01T15:21:05Z) - ViLBench: A Suite for Vision-Language Process Reward Modeling [25.565912785217822]
This paper first benchmarks current vision large language models (VLLMs) as two types of reward models.
We introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals.
We preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models.
arXiv Detail & Related papers (2025-03-26T06:38:31Z) - Boosting Virtual Agent Learning and Reasoning: A Step-wise, Multi-dimensional, and Generalist Reward Model with Benchmark [72.46357004059661]
We propose Similar, a step-wise Multi-dimensional Generalist Reward Model.
It offers fine-grained signals for agent training and can choose better action for inference-time scaling.
We introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation.
arXiv Detail & Related papers (2025-03-24T13:30:47Z) - Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models [82.92771279118888]
We introduce Multimodal RewardBench, an expert-annotated benchmark for evaluating multimodal reward models.<n>Our dataset comprises 5,211 annotated (prompt, chosen response, rejected response) triplets collected from various vision-language models.<n>We find that even the top-performing models, Gemini 1.5 Pro and Claude 3.5 Sonnet, achieve only 72% overall accuracy.
arXiv Detail & Related papers (2025-02-20T01:48:13Z) - Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.<n>InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.<n>We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z) - Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [65.64108848398696]
We introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs.
Specifically, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset.
We explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance.
arXiv Detail & Related papers (2024-11-15T18:59:27Z) - Large Language Model Evaluation Via Multi AI Agents: Preliminary results [3.8066447473175304]
We introduce a novel multi-agent AI model that aims to assess and compare the performance of various Large Language Models (LLMs)
Our model consists of eight distinct AI agents, each responsible for retrieving code based on a common description from different advanced language models.
We integrate the HumanEval benchmark into our verification agent to assess the generated code's performance, providing insights into their respective capabilities and efficiencies.
arXiv Detail & Related papers (2024-04-01T10:06:04Z) - MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions.
Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.