Fugu-MT 論文翻訳(概要): EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos

論文の概要: EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos

arxiv url: http://arxiv.org/abs/2603.29943v1
Date: Tue, 31 Mar 2026 16:16:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-01 15:25:03.844839
Title: EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos
Title（参考訳）: EC-Bench: 超長ビデオの列挙とカウントベンチマーク
Authors: Fumihiko Tsuchiya, Taiki Miyanishi, Mahiro Ukai, Nakamasa Inoue, Shuhei Kurita, Yusuke Iwasawa, Yutaka Matsuo,
Abstract要約: 現実世界の録音は数分間かそれ以上で、希少で多様なイベントを含むことが多い。既存のビデオカウントベンチマークのほとんどはショートクリップにフォーカスし、最終的な数値解のみを評価する。本稿では,長文ビデオの列挙,カウント,時間的証拠を共同評価するベンチマークEC-Benchを紹介する。
参考スコア（独自算出の注目度）: 56.23636449524238
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Counting in long videos remains a fundamental yet underexplored challenge in computer vision. Real-world recordings often span tens of minutes or longer and contain sparse, diverse events, making long-range temporal reasoning particularly difficult. However, most existing video counting benchmarks focus on short clips and evaluate only the final numerical answer, providing little insight into what should be counted or whether models consistently identify relevant instances across time. We introduce EC-Bench, a benchmark that jointly evaluates enumeration, counting, and temporal evidence grounding in long-form videos. EC-Bench contains 152 videos longer than 30 minutes and 1,699 queries paired with explicit evidence spans. Across 22 multimodal large language models (MLLMs), the best model achieves only 29.98% accuracy on Enumeration and 23.74% on Counting, while human performance reaches 78.57% and 82.97%, respectively. Our analysis reveals strong relationships between enumeration accuracy, temporal grounding, and counting performance. These results highlight fundamental limitations of current MLLMs and establish EC-Bench as a challenging benchmark for long-form quantitative video reasoning.
Abstract（参考訳）: 長いビデオのカウントは、コンピュータビジョンの根本的かつ未発見の課題である。現実世界の録音は数分間かそれ以上の時間で行われ、希少で多様な出来事が含まれており、特に長距離の時間的推論が困難である。しかし、既存のビデオカウントベンチマークのほとんどはショートクリップにフォーカスし、最終的な数値のみを評価する。本稿では,長文ビデオの列挙,カウント,時間的証拠を共同評価するベンチマークEC-Benchを紹介する。 EC-Benchには、30分以上の152のビデオと、明確なエビデンスと組み合わせた1,699のクエリが含まれている。 22のマルチモーダル大言語モデル (MLLM) 全体で、最高のモデルは列挙で29.98%、カウントで23.74%、人間のパフォーマンスは78.57%、82.97%である。分析の結果,列挙精度,時間的グラウンドリング,計数性能の強い関係が明らかとなった。これらの結果は、現在のMLLMの基本的限界を強調し、長期的定量的ビデオ推論のための挑戦的なベンチマークとしてEC-Benchを確立する。

論文の概要: EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos

関連論文リスト