Related papers: MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

URL: http://arxiv.org/abs/2506.04141v1
Date: Wed, 04 Jun 2025 16:33:41 GMT
Title: MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos
Authors: Kejian Zhu, Zhuoran Jin, Hongbang Yuan, Jiachun Li, Shangqing Tu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao,
Abstract summary: We propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos.<n>The benchmark is characterized by the following features.<n>Experiments reveal that current models still struggle with multi-modal reasoning.
Score: 22.10711693948861
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.

Related papers

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning [29.811030252357195]
multimodal large language models (MLLMs) are crucial for downstream tasks like video question answering and temporal grounding.<n>We propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework.
arXiv Detail & Related papers (2025-08-06T13:03:21Z)
MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI [59.196131618912005]
Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs)<n>Existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities.<n>We introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability.
arXiv Detail & Related papers (2025-06-30T07:14:38Z)
VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks [44.30048178589923]
We introduce two novel datasets designed to stimulate the model's advanced video understanding and reasoning abilities.<n>We develop VersaVid-R1, the first versatile video understanding and reasoning model under the Reason-Then-Respond paradigm.
arXiv Detail & Related papers (2025-06-10T03:57:53Z)
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z)
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? [18.9270920369958]
Long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks.<n>Recent efforts have proposed benchmarks aimed at video reasoning, but tasks are often knowledge-driven and do not rely heavily on visual content.<n>We introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning.
arXiv Detail & Related papers (2025-05-29T11:33:43Z)
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? [56.06537213958482]
We present Video-Holmes, a benchmark designed to evaluate the complex video reasoning capabilities of MLLMs.<n>Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films.<n>Our comprehensive evaluation of state-of-the-art MLLMs reveals that, while these models generally excel at visual perception, they encounter substantial difficulties with integrating information.
arXiv Detail & Related papers (2025-05-27T16:05:01Z)
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning [33.37714717781103]
VideoMind is a novel video-language agent designed for temporal-grounded video understanding.<n>We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow.<n>We propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors.
arXiv Detail & Related papers (2025-03-17T17:59:33Z)
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z)
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models [26.17300490736624]
Multimodal Large Language Models (MLLMs) are predominantly trained and tested on consistent visual-textual inputs.<n>We propose the Multimodal Inconsistency Reasoning benchmark to assess MLLMs' ability to detect and reason about semantic mismatches.<n>We evaluate six state-of-the-art MLLMs, showing that models with dedicated multimodal reasoning capabilities, such as o1, substantially outperform their counterparts.
arXiv Detail & Related papers (2025-02-22T01:52:37Z)
Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level [63.18855743293851]
Motion-Grounded Video Reasoning is a new motion understanding task that requires visual answers (video segmentation masks) according to the input question.<n>This task extends existing grounding work on explicit action/motion grounding to a more general format by enabling implicit reasoning via questions.<n>We introduce a novel baseline model named Motion-Grounded Video Reasoning Assistant (MORA)
arXiv Detail & Related papers (2024-11-15T03:45:09Z)
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models [28.883607056108605]
TOMATO is a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding. TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks. Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model.
arXiv Detail & Related papers (2024-10-30T17:50:23Z)
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.