Related papers: Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models

Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models

URL: http://arxiv.org/abs/2502.14191v1
Date: Thu, 20 Feb 2025 01:48:13 GMT
Title: Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models
Authors: Michihiro Yasunaga, Luke Zettlemoyer, Marjan Ghazvininejad,
Abstract summary: We introduce Multimodal RewardBench, an expert-annotated benchmark for evaluating multimodal reward models.<n>Our dataset comprises 5,211 annotated (prompt, chosen response, rejected response) triplets collected from various vision-language models.<n>We find that even the top-performing models, Gemini 1.5 Pro and Claude 3.5 Sonnet, achieve only 72% overall accuracy.
Score: 82.92771279118888
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reward models play an essential role in training vision-language models (VLMs) by assessing output quality to enable aligning with human preferences. Despite their importance, the research community lacks comprehensive open benchmarks for evaluating multimodal reward models in VLMs. To address this gap, we introduce Multimodal RewardBench, an expert-annotated benchmark covering six domains: general correctness, preference, knowledge, reasoning, safety, and visual question-answering. Our dataset comprises 5,211 annotated (prompt, chosen response, rejected response) triplets collected from various VLMs. In evaluating a range of VLM judges, we find that even the top-performing models, Gemini 1.5 Pro and Claude 3.5 Sonnet, achieve only 72% overall accuracy. Notably, most models struggle in the reasoning and safety domains. These findings suggest that Multimodal RewardBench offers a challenging testbed for advancing reward model development across multiple domains. We release the benchmark at https://github.com/facebookresearch/multimodal_rewardbench.

Related papers

RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs [40.96433915889983]
We present a benchmark, dubbed RBench-V, designed to assess models' vision-indispensable reasoning abilities.<n>Unlike previous benchmarks that typically specify certain input modalities, RBench-V presents problems centered on multi-modal outputs.<n>Even the best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below the human score of 82.3%.
arXiv Detail & Related papers (2025-05-22T15:11:57Z)
ViLBench: A Suite for Vision-Language Process Reward Modeling [25.565912785217822]
This paper first benchmarks current vision large language models (VLLMs) as two types of reward models. We introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. We preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models.
arXiv Detail & Related papers (2025-03-26T06:38:31Z)
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z)
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model [80.93387166769679]
We present IXC-2.5-Reward, a simple yet effective multi-modal reward model that aligns Large Vision Language Models with human preferences.<n>IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks.
arXiv Detail & Related papers (2025-01-21T18:47:32Z)
DARE: Diverse Visual Question Answering with Robustness Evaluation [16.87867803628065]
Vision Language Models (VLMs) extend remarkable capabilities of text-only large language models and vision-only models. They struggle with a number of crucial vision-language (VL) reasoning abilities such as counting and spatial reasoning. We introduce DARE, Diverse Visual Question Answering with Robustness Evaluation.
arXiv Detail & Related papers (2024-09-26T16:31:50Z)
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? [59.7772329962047]
We introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges. Specifically, we evaluate a large variety of multimodal judges including smaller-sized CLIP-based scoring models, open-source VLMs, and close-source VLMs. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in average.
arXiv Detail & Related papers (2024-07-05T20:03:16Z)
RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models. The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z)
MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models. MMBench is meticulously curated with well-designed quality control schemes. MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.