Related papers: Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory

Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory

URL: http://arxiv.org/abs/2603.02663v1
Date: Tue, 03 Mar 2026 06:51:08 GMT
Title: Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory
Authors: Shunki Uebayashi, Kento Masui, Kyohei Atarashi, Han Bao, Hisashi Kashima, Naoto Inoue, Mayu Otani, Koh Takeuchi,
Abstract summary: Benchmarks for Multimodal Large Language Models should measure their ability for cross-modal integration.<n>Current benchmarks are filled with shortcut questions, which can be solved using only a single modality.<n>We introduce a multi-modal and multidimensional item response theory framework (M3IRT) that extends classical IRT.<n>M3IRT estimates cross-modal ability of MLLMs and each question's cross-modal difficulty, enabling compact, high-quality subsets.
Score: 22.63245796446805
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have recently emerged as general architectures capable of reasoning over diverse modalities. Benchmarks for MLLMs should measure their ability for cross-modal integration. However, current benchmarks are filled with shortcut questions, which can be solved using only a single modality, thereby yielding unreliable rankings. For example, in vision-language cases, we can find the correct answer without either the image or the text. These low-quality questions unnecessarily increase the size and computational requirements of benchmarks. We introduce a multi-modal and multidimensional item response theory framework (M3IRT) that extends classical IRT by decomposing both model ability and item difficulty into image-only, text-only, and cross-modal components. M3IRT estimates cross-modal ability of MLLMs and each question's cross-modal difficulty, enabling compact, high-quality subsets that better reflect multimodal reasoning. Across 24 VLMs on three benchmarks, M3IRT prioritizes genuinely cross-modal questions over shortcuts and preserves ranking fidelity even when 50% of items are artificially generated low-quality questions, thereby reducing evaluation cost while improving reliability. M3IRT thus offers a practical tool for assessing cross-modal reasoning and refining multimodal benchmarks.

Related papers

UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark [72.37370242707432]
This paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset.<n>UniM contains 31K high-quality instances across 30 domains and 7 representative modalities.<n>We also introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence.
arXiv Detail & Related papers (2026-03-05T11:45:16Z)
MMR-Bench: A Comprehensive Benchmark for Multimodal LLM Routing [41.77627136743721]
In practical deployments, workloads span lightweight OCR to complex multimodal reasoning.<n>routing is nontrivial due to modality fusion, wide variation in computational cost across models, and the absence of a standardized, budget-aware evaluation.<n>We present MMR-Bench, a unified benchmark that isolates the multimodal routing problem and enables comparison under fixed candidate sets and cost models.
arXiv Detail & Related papers (2026-01-25T12:44:14Z)
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents [78.3863007028688]
MM-BrowseComp is a novel benchmark comprising 224 challenging, hand-crafted questions.<n>These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages.<n>Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02% accuracy.
arXiv Detail & Related papers (2025-08-14T13:46:47Z)
Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models [45.15161506154318]
Infi-MMR is a framework to systematically unlock the reasoning potential of Multimodal Small Language Models.<n>The first phase, Foundational Reasoning Activation, leverages high-quality textual reasoning datasets to activate and strengthen the model's logical reasoning capabilities.<n>The second phase, Cross-Modal Reasoning Adaptation, utilizes caption-augmented multimodal data to facilitate the progressive transfer of reasoning skills to multimodal contexts.<n>The third phase, Multimodal Reasoning Enhancement, employs curated, caption-free multimodal data to mitigate linguistic biases and promote robust cross-modal reasoning.
arXiv Detail & Related papers (2025-05-29T04:51:56Z)
Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective [42.832839189236694]
We propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images.<n>Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent.<n> Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.
arXiv Detail & Related papers (2025-05-27T07:23:38Z)
Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.7225771305861]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M$2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models.<n>The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking.<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z)
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models [26.17300490736624]
Multimodal Large Language Models (MLLMs) are predominantly trained and tested on consistent visual-textual inputs.<n>We propose the Multimodal Inconsistency Reasoning benchmark to assess MLLMs' ability to detect and reason about semantic mismatches.<n>We evaluate six state-of-the-art MLLMs, showing that models with dedicated multimodal reasoning capabilities, such as o1, substantially outperform their counterparts.
arXiv Detail & Related papers (2025-02-22T01:52:37Z)
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark [73.27104042215207]
We introduce EMMA, a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding.<n>EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality.<n>Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks.
arXiv Detail & Related papers (2025-01-09T18:55:52Z)
FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning [5.65203350495478]
We present Financial Cross-Modal Multi-Hop Reasoning (FCMR), a benchmark to analyze the reasoning capabilities of MLLMs.<n>FCMR is categorized into three difficulty levels-Easy, Medium, and Hard-facilitating a step-by-step evaluation.<n>Experiments on this new benchmark reveal that even state-of-the-art MLLMs struggle, with the best-performing model achieving only 30.4% accuracy on the most challenging tier.
arXiv Detail & Related papers (2024-12-17T05:50:55Z)
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark [77.93283927871758]
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning benchmark.<n> MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities.
arXiv Detail & Related papers (2024-09-04T15:31:26Z)
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning [109.9413329636322]
This paper introduces an efficient framework that integrates multiple modalities (images, 3D, audio and video) to a frozen Large Language Models (LLMs) Our approach explores two distinct projection mechanisms: Q-Formers and Linear Projections (LPs)
arXiv Detail & Related papers (2023-11-30T18:43:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.