Related papers: Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

URL: http://arxiv.org/abs/2512.16899v1
Date: Thu, 18 Dec 2025 18:56:04 GMT
Title: Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
Authors: Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer, Marjan Ghazvininejad,
Abstract summary: We introduce Multimodal RewardBench 2 (MMRB2), the first benchmark for reward models on multimodal understanding and (interleaved) generation.<n>MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning.<n>It provides 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks.
Score: 58.14192385042352
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.

Related papers

Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis [22.55861092515539]
A critical bottleneck remains the lack of effective reward models (RMs)<n>We introduce textbf Omni-RRM, the first open-source rubric-grounded reward model.<n>It produces structured, multi-dimension preference judgments with dimension-wise justifications across textbftext, image, video, and audio
arXiv Detail & Related papers (2026-01-31T18:20:45Z)
Scheming Ability in LLM-to-LLM Strategic Interactions [4.873362301533824]
Large language model (LLM) agents are deployed autonomously in diverse contexts.<n>We investigate the ability and propensity of frontier LLM agents through two game-theoretic frameworks.<n>Tests four models (GPT-4o, Gemini-2.5-pro, Claude-3.7-Sonnet, and Llama-3.3-70b)
arXiv Detail & Related papers (2025-10-11T04:42:29Z)
MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation [81.26818054877658]
MMMG is a comprehensive benchmark for multimodal generation across 4 modality combinations.<n>It is highly aligned with human evaluation, achieving an average agreement of 94.3%.<n>GPT Image achieves 78.3% accuracy for image generation, but falls short on multimodal reasoning and interleaved generation.
arXiv Detail & Related papers (2025-05-23T08:21:28Z)
LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models [59.0256377330646]
Lens is a benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios.<n>This dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning.<n>We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL.
arXiv Detail & Related papers (2025-05-21T15:06:59Z)
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs)<n>Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks.<n>For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z)
Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models [82.92771279118888]
We introduce Multimodal RewardBench, an expert-annotated benchmark for evaluating multimodal reward models.<n>Our dataset comprises 5,211 annotated (prompt, chosen response, rejected response) triplets collected from various vision-language models.<n>We find that even the top-performing models, Gemini 1.5 Pro and Claude 3.5 Sonnet, achieve only 72% overall accuracy.
arXiv Detail & Related papers (2025-02-20T01:48:13Z)
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model [80.93387166769679]
We present IXC-2.5-Reward, a simple yet effective multi-modal reward model that aligns Large Vision Language Models with human preferences.<n>IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks.
arXiv Detail & Related papers (2025-01-21T18:47:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.