ViLBench: A Suite for Vision-Language Process Reward Modeling
- URL: http://arxiv.org/abs/2503.20271v1
- Date: Wed, 26 Mar 2025 06:38:31 GMT
- Title: ViLBench: A Suite for Vision-Language Process Reward Modeling
- Authors: Haoqin Tu, Weitao Feng, Hardy Chen, Hui Liu, Xianfeng Tang, Cihang Xie,
- Abstract summary: This paper first benchmarks current vision large language models (VLLMs) as two types of reward models.<n>We introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals.<n>We preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models.
- Score: 25.565912785217822
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI's GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models -- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1's generations. We release the implementations at https://ucsc-vlaa.github.io/ViLBench with our code, model, and data.
Related papers
- VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs)<n>Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks.<n>For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z) - Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models [82.92771279118888]
We introduce Multimodal RewardBench, an expert-annotated benchmark for evaluating multimodal reward models.<n>Our dataset comprises 5,211 annotated (prompt, chosen response, rejected response) triplets collected from various vision-language models.<n>We find that even the top-performing models, Gemini 1.5 Pro and Claude 3.5 Sonnet, achieve only 72% overall accuracy.
arXiv Detail & Related papers (2025-02-20T01:48:13Z) - MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs.
We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms.
Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z) - Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.<n>InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.<n>We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z) - Gamified crowd-sourcing of high-quality data for visual fine-tuning [0.9487395978583629]
This paper introduces Gamified Adversarial Prompting (GAP), a framework that crowd-sources high-quality data for visual instruction tuning of large multimodal models.
GAP transforms the data collection process into an engaging game, incentivizing players to provide fine-grained, challenging questions and answers that target gaps in the model's knowledge.
arXiv Detail & Related papers (2024-10-05T05:10:29Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Large Language Model Evaluation Via Multi AI Agents: Preliminary results [3.8066447473175304]
We introduce a novel multi-agent AI model that aims to assess and compare the performance of various Large Language Models (LLMs)
Our model consists of eight distinct AI agents, each responsible for retrieving code based on a common description from different advanced language models.
We integrate the HumanEval benchmark into our verification agent to assess the generated code's performance, providing insights into their respective capabilities and efficiencies.
arXiv Detail & Related papers (2024-04-01T10:06:04Z) - ZhichunRoad at Amazon KDD Cup 2022: MultiTask Pre-Training for
E-Commerce Product Search [4.220439000486713]
We propose a robust multilingual model to improve the quality of search results.
In pre-training stage, we adopt mlm task, classification task and contrastive learning task.
In fine-tuning stage, we use confident learning, exponential moving average method (EMA), adversarial training (FGM) and regularized dropout strategy (R-Drop)
arXiv Detail & Related papers (2023-01-31T07:31:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.