RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs
- URL: http://arxiv.org/abs/2505.16770v2
- Date: Fri, 23 May 2025 15:40:22 GMT
- Title: RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs
- Authors: Meng-Hao Guo, Xuanyu Chu, Qianrui Yang, Zhe-Han Mo, Yiqing Shen, Pei-lin Li, Xinjie Lin, Jinnian Zhang, Xin-Sheng Chen, Yi Zhang, Kiyohiro Nakayama, Zhengyang Geng, Houwen Peng, Han Hu, Shi-Min Hu,
- Abstract summary: We present a benchmark, dubbed RBench-V, designed to assess models' vision-indispensable reasoning abilities.<n>Unlike previous benchmarks that typically specify certain input modalities, RBench-V presents problems centered on multi-modal outputs.<n>Even the best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below the human score of 82.3%.
- Score: 40.96433915889983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of native multi-modal models and omni-models, exemplified by GPT-4o, Gemini, and o3, with their capability to process and generate content across modalities such as text and images, marks a significant milestone in the evolution of intelligence. Systematic evaluation of their multi-modal output capabilities in visual thinking processes (also known as multi-modal chain of thought, M-CoT) becomes critically important. However, existing benchmarks for evaluating multi-modal models primarily focus on assessing multi-modal inputs and text-only reasoning while neglecting the importance of reasoning through multi-modal outputs. In this paper, we present a benchmark, dubbed RBench-V, designed to assess models' vision-indispensable reasoning abilities. To construct RBench-V, we carefully hand-pick 803 questions covering math, physics, counting, and games. Unlike previous benchmarks that typically specify certain input modalities, RBench-V presents problems centered on multi-modal outputs, which require image manipulation such as generating novel images and constructing auxiliary lines to support the reasoning process. We evaluate numerous open- and closed-source models on RBench-V, including o3, Gemini 2.5 Pro, Qwen2.5-VL, etc. Even the best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below the human score of 82.3%, highlighting that current models struggle to leverage multi-modal reasoning. Data and code are available at https://evalmodels.github.io/rbenchv
Related papers
- VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering [8.21219588747224]
This paper introduces Vision-Language Multimodal Transformer (VLMT), a unified architecture that integrates a vision encoder with a sequence-to-sequence language model.<n>VLMT employs a direct token-level injection mechanism to fuse visual and textual inputs within a shared embedding space.<n> Comprehensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2025-04-11T05:51:44Z) - VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs)<n>Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks.<n>For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z) - Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models [82.92771279118888]
We introduce Multimodal RewardBench, an expert-annotated benchmark for evaluating multimodal reward models.<n>Our dataset comprises 5,211 annotated (prompt, chosen response, rejected response) triplets collected from various vision-language models.<n>We find that even the top-performing models, Gemini 1.5 Pro and Claude 3.5 Sonnet, achieve only 72% overall accuracy.
arXiv Detail & Related papers (2025-02-20T01:48:13Z) - CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models [60.08485416687596]
Chain of Multi-modal Thought (CoMT) benchmark aims to mimic human-like reasoning that inherently integrates visual operation.<n>We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches.
arXiv Detail & Related papers (2024-12-17T14:10:16Z) - MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities [146.4724093405187]
We introduce MM-Vet v2, which includes a new "image-text sequence understanding" capability called "image-text sequence understanding"<n>Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0.
arXiv Detail & Related papers (2024-08-01T17:59:54Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models.
MMBench is meticulously curated with well-designed quality control schemes.
MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z) - Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content
Dilutions [27.983902791798965]
We develop a model that generates dilution text that maintains relevance and topical coherence with the image and existing text.
We find that the performance of task-specific fusion-based multimodal classifiers drops by 23.3% and 22.5%, respectively, in the presence of dilutions generated by our model.
Our work aims to highlight and encourage further research on the robustness of deep multimodal models to realistic variations.
arXiv Detail & Related papers (2022-11-04T17:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.