Fugu-MT 論文翻訳(概要): Video-Oasis: Rethinking Evaluation of Video Understanding

論文の概要: Video-Oasis: Rethinking Evaluation of Video Understanding

arxiv url: http://arxiv.org/abs/2603.29616v1
Date: Tue, 31 Mar 2026 11:37:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-01 15:25:03.578212
Title: Video-Oasis: Rethinking Evaluation of Video Understanding
Title（参考訳）: Video-Oasis:ビデオ理解の評価を再考する
Authors: Geuntaek Lim, Minho Shim, Sungjune Park, Jaeyun Lee, Inwoong Lee, Taeoh Kim, Dongyoon Wee, Yukyung Choi,
Abstract要約: ビデオ理解は、パフォーマンス向上が視覚的知覚、言語的推論、あるいは知識事前に起因するかどうかを判断するのは難しい。ビデオ理解のための既存の評価と蒸留時間課題を評価するための診断スイートである Video-Oasis を提供する。分析の結果,(1)既存のベンチマークサンプルの54%は視覚的入力や時間的文脈を使わずに解決可能であること,(2)残りのサンプルでは,最先端のモデルではランダムな推測以上の性能を示すことが判明した。
参考スコア（独自算出の注目度）: 20.076100437038313
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The inherent complexity of video understanding makes it difficult to attribute whether performance gains stem from visual perception, linguistic reasoning, or knowledge priors. While many benchmarks have emerged to assess high-level reasoning, the essential criteria that constitute video understanding remain largely overlooked. Instead of introducing yet another benchmark, we take a step back to re-examine the current landscape of video understanding. In this work, we provide Video-Oasis, a sustainable diagnostic suite designed to systematically evaluate existing evaluations and distill spatio-temporal challenges for video understanding. Our analysis reveals two critical findings: (1) 54% of existing benchmark samples are solvable without visual input or temporal context, and (2) on the remaining samples, state-of-the-art models exhibit performance barely exceeding random guessing. To bridge this gap, we investigate which algorithmic design choices contribute to robust video understanding, providing practical guidelines for future research. We hope our work serves as a standard guideline for benchmark construction and the rigorous evaluation of architecture development. Code is available at https://github.com/sejong-rcv/Video-Oasis.
Abstract（参考訳）: ビデオ理解の本質的な複雑さは、パフォーマンスの獲得が視覚的知覚、言語的推論、あるいは知識の先行に由来するかどうかを判断するのを困難にしている。ハイレベルな推論を評価するために多くのベンチマークが登場したが、ビデオ理解を構成する重要な基準はほとんど見過ごされ続けている。新たなベンチマークを導入する代わりに、ビデオ理解の現在の状況を再検討する。本研究では,既存の評価を体系的に評価し,ビデオ理解のための時空間的課題を抽出する,持続可能な診断スイートであるVideo-Oasisを提供する。分析の結果,(1)既存のベンチマークサンプルの54%は視覚的入力や時間的文脈を使わずに解決可能であること,(2)残りのサンプルでは,最先端のモデルではランダムな推測以上の性能を示すことが判明した。このギャップを埋めるために,ロバストな映像理解にどのようなアルゴリズム設計が寄与するかを考察し,今後の研究の実践的ガイドラインを提供する。私たちの仕事は、ベンチマークの構築とアーキテクチャ開発の厳格な評価のための標準ガイドラインとして機能することを願っています。コードはhttps://github.com/sejong-rcv/Video-Oasisで公開されている。

論文の概要: Video-Oasis: Rethinking Evaluation of Video Understanding

関連論文リスト