Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
- URL: http://arxiv.org/abs/2505.21374v1
- Date: Tue, 27 May 2025 16:05:01 GMT
- Title: Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
- Authors: Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, Ying Shan,
- Abstract summary: We present Video-Holmes, a benchmark designed to evaluate the complex video reasoning capabilities of MLLMs.<n>Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films.<n>Our comprehensive evaluation of state-of-the-art MLLMs reveals that, while these models generally excel at visual perception, they encounter substantial difficulties with integrating information.
- Score: 56.06537213958482
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in CoT reasoning and RL post-training have been reported to enhance video reasoning capabilities of MLLMs. This progress naturally raises a question: can these models perform complex video reasoning in a manner comparable to human experts? However, existing video benchmarks primarily evaluate visual perception and grounding abilities, with questions that can be answered based on explicit prompts or isolated visual cues. Such benchmarks do not fully capture the intricacies of real-world reasoning, where humans must actively search for, integrate, and analyze multiple clues before reaching a conclusion. To address this issue, we present Video-Holmes, a benchmark inspired by the reasoning process of Sherlock Holmes, designed to evaluate the complex video reasoning capabilities of MLLMs. Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films, which spans seven carefully designed tasks. Each task is constructed by first identifying key events and causal relationships within films, and then designing questions that require models to actively locate and connect multiple relevant visual clues scattered across different video segments. Our comprehensive evaluation of state-of-the-art MLLMs reveals that, while these models generally excel at visual perception, they encounter substantial difficulties with integrating information and often miss critical clues. For example, the best-performing model, Gemini-2.5-Pro, achieves an accuracy of only 45%, with most models scoring below 40%. We aim that Video-Holmes can serve as a "Holmes-test" for multimodal reasoning, motivating models to reason more like humans and emphasizing the ongoing challenges in this field. The benchmark is released in https://github.com/TencentARC/Video-Holmes.
Related papers
- Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning [29.811030252357195]
multimodal large language models (MLLMs) are crucial for downstream tasks like video question answering and temporal grounding.<n>We propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework.
arXiv Detail & Related papers (2025-08-06T13:03:21Z) - GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them? [76.67205289006795]
GLIMPSE consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories.<n>All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context.<n>In human evaluations, GLIMPSE achieves 94.82% accuracy, but current LVLMs face significant challenges.
arXiv Detail & Related papers (2025-07-13T04:44:57Z) - ImplicitQA: Going beyond frames towards Implicit Video Reasoning [36.65883181090953]
ImplicitQA is a novel benchmark designed to test models on implicit reasoning.<n>It comprises 1K meticulously annotated QA pairs derived from 320+ high-quality creative video clips.
arXiv Detail & Related papers (2025-06-26T19:53:54Z) - VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks [44.30048178589923]
We introduce two novel datasets designed to stimulate the model's advanced video understanding and reasoning abilities.<n>We develop VersaVid-R1, the first versatile video understanding and reasoning model under the Reason-Then-Respond paradigm.
arXiv Detail & Related papers (2025-06-10T03:57:53Z) - Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding [97.05584099530226]
We introduce MF$2$, a new benchmark for evaluating whether models can comprehend, consolidate, and recall key narrative information from full-length movies.<n>For each pair, models must correctly identify both the true and false claims.<n>Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance.
arXiv Detail & Related papers (2025-06-06T17:58:36Z) - MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos [22.10711693948861]
We propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos.<n>The benchmark is characterized by the following features.<n>Experiments reveal that current models still struggle with multi-modal reasoning.
arXiv Detail & Related papers (2025-06-04T16:33:41Z) - VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? [18.9270920369958]
Long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks.<n>Recent efforts have proposed benchmarks aimed at video reasoning, but tasks are often knowledge-driven and do not rely heavily on visual content.<n>We introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning.
arXiv Detail & Related papers (2025-05-29T11:33:43Z) - SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models [80.3895950009792]
Achieving fine-grained-temporal understanding in videos remains a major challenge for current Video Large Multimodels (Video LMMs)<n>We contribute in three core aspects: dataset, model, and benchmark.<n>First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically to enable joint learning of video understanding, grounding, and multi-turn video chat.<n>Second, we propose the SAMA model, which incorporates a versatile-temporal context aggregator and a Segment Model to jointly enhance fine-grained video comprehension and precise grounding capabilities.
arXiv Detail & Related papers (2025-05-24T18:13:16Z) - VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding [54.16233954353802]
We introduce VideoHallu, a benchmark of over 3,000 video QA pairs built from synthetic videos generated by models like Veo2, Sora, and Kling.<n>We evaluate the critical thinking abilities of Multi-modal Large Language Models (MLLMs) on abnormalities that are perceptually obvious to humans but often hallucinated due to language priors.<n>We observe that these models perform well on many real-world benchmarks like MVBench and MovieChat, but still struggle with basic physics-based and commonsense reasoning in synthetic videos.
arXiv Detail & Related papers (2025-05-02T15:58:38Z) - MINERVA: Evaluating Complex Video Reasoning [72.12644008002566]
We provide a new video reasoning dataset called MINERVA for modern multimodal models.<n>Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions.<n>We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors.
arXiv Detail & Related papers (2025-05-01T17:41:49Z) - CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding [43.858197893052115]
CG-Bench is a novel benchmark for clue-grounded question answering in long videos.<n>It features 1,219 manually curated videos categorized by a granular system with 14 primary categories, 171 secondary categories, and 638 tertiary categories.<n>The benchmark includes 12,129 QA pairs in three major question types: perception, reasoning, and hallucination.
arXiv Detail & Related papers (2024-12-16T18:46:45Z) - Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events [33.51522765443546]
BlackSwanSuite is a benchmark for evaluating vision-language models' ability to reason about unexpected events.<n>We curate a comprehensive benchmark suite comprising over 3,800 MCQ, 4,900 generative and 6,700 yes/no questions, spanning 1,655 videos.<n>We find significant performance gaps of up to 32% from humans on these tasks.
arXiv Detail & Related papers (2024-12-07T19:19:03Z) - Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level [63.18855743293851]
Motion-Grounded Video Reasoning is a new motion understanding task that requires visual answers (video segmentation masks) according to the input question.<n>This task extends existing grounding work on explicit action/motion grounding to a more general format by enabling implicit reasoning via questions.<n>We introduce a novel baseline model named Motion-Grounded Video Reasoning Assistant (MORA)
arXiv Detail & Related papers (2024-11-15T03:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.