SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models
- URL: http://arxiv.org/abs/2510.08559v1
- Date: Thu, 09 Oct 2025 17:59:23 GMT
- Title: SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models
- Authors: Andong Deng, Taojiannan Yang, Shoubin Yu, Lincoln Spencer, Mohit Bansal, Chen Chen, Serena Yeung-Levy, Xiaohan Wang,
- Abstract summary: We introduce SciVideoBench, a rigorous benchmark designed to assess advanced video reasoning in scientific contexts.<n>SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos.<n>Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs.
- Score: 89.10286051587151
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models' higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.
Related papers
- SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System [42.24867393406526]
SciEducator is a self-evolving multi-agent system for scientific video comprehension and education.<n>Our design reformulates its Plan-Do-Study-Act philosophy into a self-evolving reasoning and feedback mechanism.<n>It can produce multimodal educational content tailored to specific scientific processes.
arXiv Detail & Related papers (2025-11-22T06:54:16Z) - MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs [61.70050081221131]
MVU-Eval is the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs.<n>Our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos.<n>These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics.
arXiv Detail & Related papers (2025-11-10T16:02:33Z) - MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues [38.63457491325088]
We introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues.<n>Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues.<n>These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring.
arXiv Detail & Related papers (2025-10-20T16:38:40Z) - Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models [78.40747899831793]
Video understanding represents the most challenging frontier in computer vision.<n>Recent emergence of Video-Large Multitemporal Models has demonstrated remarkable capabilities in video understanding tasks.<n>Survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities.
arXiv Detail & Related papers (2025-10-06T17:10:44Z) - Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning [39.6349428129868]
multimodal large language models (MLLMs) are crucial for downstream tasks like video question answering and temporal grounding.<n>We propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework.<n>With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning.
arXiv Detail & Related papers (2025-08-06T13:03:21Z) - HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data [55.739633494946204]
We present HumanVBench, an innovative benchmark meticulously crafted to bridge gaps in the evaluation of video MLLMs.<n>HumanVBench comprises 16 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects.<n>A comprehensive evaluation across 22 SOTA video MLLMs reveals notable limitations in current performance, especially in cross-modal and emotion perception.
arXiv Detail & Related papers (2024-12-23T13:45:56Z) - Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs [76.15356325947731]
We introduce Q-Bench-Video, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality.<n>We collect a total of 2,378 question-answer pairs and test them on 12 open-source & 5 proprietary LMMs.<n>Our findings indicate that while LMMs have a foundational understanding of video quality, their performance remains incomplete and imprecise, with a notable discrepancy compared to human performance.
arXiv Detail & Related papers (2024-09-30T08:05:00Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.