Fugu-MT 論文翻訳(概要): SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

論文の概要: SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

arxiv url: http://arxiv.org/abs/2510.08559v1
Date: Thu, 09 Oct 2025 17:59:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:15.303777
Title: SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models
Title（参考訳）: SciVideoBench: 大規模マルチモーダルモデルにおける科学的ビデオ推論のベンチマーク
Authors: Andong Deng, Taojiannan Yang, Shoubin Yu, Lincoln Spencer, Mohit Bansal, Chen Chen, Serena Yeung-Levy, Xiaohan Wang,
Abstract要約: SciVideoBenchは、科学的文脈における高度なビデオ推論を評価するために設計された厳密なベンチマークである。 SciVideoBenchは、最先端の科学実験ビデオから得られた、慎重に構築された1000の多重選択質問で構成されている。我々の評価は、最先端のプロプライエタリおよびオープンソース LMM における大幅な性能低下を浮き彫りにしている。
参考スコア（独自算出の注目度）: 89.10286051587151
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models' higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.
Abstract（参考訳）: 大規模マルチモーダルモデル(LMM)は様々な能力において顕著な進歩を遂げてきたが、科学領域における複雑なビデオ推論は依然として重要かつ挑戦的なフロンティアである。現在のビデオベンチマークは、認識/認識が大きく依存する一般的なシナリオを主にターゲットとしているが、比較的単純な推論タスクでは飽和し、高度なマルチモーダル認知スキルを効果的に評価できない。この重要なギャップに対処するために、科学的な文脈で高度なビデオ推論を評価するために特別に設計された厳密なベンチマークであるSciVideoBenchを紹介する。 SciVideoBenchは、25名以上の専門的な研究対象にまたがる最先端の科学実験ビデオから得られた、慎重に構築された1000の多重選択質問で構成され、セミオートマチックシステムによって検証される。各質問は、洗練されたドメイン固有の知識、正確な時空間知覚、複雑な論理的推論を必要とし、モデルの高次認知能力に効果的に挑戦する。 Gemini 2.5 Pro や Qwen2.5-VL など,最先端のプロプライエタリおよびオープンソース LMM では,映像推論能力の大幅な向上の余地が指摘されている。推論の複雑さや視覚的接地といった重要な要因の詳細な分析は、LMMの今後の発展に価値ある洞察と明確な方向性をもたらし、真に有能なマルチモーダルAIコサイシストの進化を促す。 SciVideoBenchがコミュニティの利益にフィットし、最先端のAIの境界を国境科学に推し進めてくれることを期待している。

論文の概要: SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

関連論文リスト