Fugu-MT 論文翻訳(概要): Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

論文の概要: Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

arxiv url: http://arxiv.org/abs/2606.02522v1
Date: Mon, 01 Jun 2026 17:32:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:32.538129
Title: Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
Title（参考訳）: モーメントビデオ:モーメント・ヴィジュアル・イベントにおけるビデオMLLMの時間的忠実度診断
Authors: Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang, Yan Li, Xin Li, Haoyu Cao, Xing Sun, Shaofeng Zhang, Xu Yang, Zhihang Zhong, Xue Yang,
Abstract要約: ビデオマルチモーダル大言語モデル(MLLM)は、一般的なビデオ理解と長大なビデオ理解を急速に進歩させてきたが、短い回答クリティカルな視覚的証拠を保存できる能力はいまだに未発見のままである。本稿では,映像MLLMの時間的忠実度を時間的視覚的事象理解によって診断するためのベンチマークであるMoment-Videoを紹介する。
参考スコア（独自算出の注目度）: 52.031070006859544
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to preserve brief answer-critical visual evidence remains underexplored. Many practical questions are determined by momentary visual events: localized actions or state transitions that may last only a few frames. Such evidence can be skipped by sparse frame sampling, suppressed by visual-token compression, or diluted by coarse temporal aggregation, causing failures that language-side reasoning cannot reliably recover. We introduce Moment-Video, a benchmark for diagnosing the temporal fidelity of video MLLMs through momentary visual event understanding. Each question is grounded in a localized, visually observable, and sampling-sensitive event, requiring models to notice, count, describe, or reason about transient evidence rather than rely on persistent objects, global scene context, or language priors. Moment-Video contains 1,000 human-verified video-QA pairs across 7 domains and 25 fine-grained subcategories, covering four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. We evaluate 33 proprietary and open-source MLLMs on Moment-Video. The best-performing model, Seed-2.0-Pro, achieves only 39.6% overall accuracy, while most open-source models remain below 25%, revealing a substantial gap in momentary visual event understanding. Diagnostic analyses show that denser frame sampling improves some models but does not eliminate the bottleneck, and longer videos introduce stronger temporal-localization challenges. These findings suggest that current video MLLMs still lack temporally faithful representations for capturing, preserving, and using brief but decisive visual evidence.
Abstract（参考訳）: ビデオマルチモーダル大言語モデル(MLLM)は、一般的なビデオ理解と長大なビデオ理解を急速に進歩させてきたが、短い回答クリティカルな視覚的証拠を保存できる能力はいまだに未発見のままである。多くの実践的な質問は、局所化されたアクションや数フレームしか持たない状態遷移といった、一時的な視覚イベントによって決定される。このような証拠はスパースフレームサンプリングや視覚的トーケン圧縮による抑制、あるいは粗い時間的アグリゲーションによって希釈され、言語側の推論が確実に回復できない失敗を引き起こす。本稿では,映像MLLMの時間的忠実度を時間的視覚的事象理解によって診断するためのベンチマークであるMoment-Videoを紹介する。各質問は、局所的で、視覚的に観察可能で、サンプリングに敏感なイベントに基礎を置いており、永続的なオブジェクト、グローバルなシーンコンテキスト、あるいは言語優先に依存するのではなく、一時的なエビデンスに注意、カウント、説明、あるいは理由をモデルに要求する。 Moment-Videoには、7つのドメインと25のきめ細かいサブカテゴリにまたがる、1,000の人間検証ビデオ-QAペアが含まれており、時間的発生、時間的カウント、アクション記述、時間的推論の4つのタスクタイプをカバーしている。我々は、Moment-Video上で、33のプロプライエタリでオープンソースのMLLMを評価した。最高のパフォーマンスモデルであるSeed-2.0-Proは全体の39.6%の精度しか達成していないが、ほとんどのオープンソースモデルは25%以下であり、瞬間的な視覚イベント理解において大きなギャップがあることを示している。診断分析により、より高密度なフレームサンプリングはいくつかのモデルを改善するがボトルネックを排除せず、より長いビデオでは時間的局所化の課題がより強まることが示された。これらの結果は、現在のビデオMLLMには、短いが決定的な視覚的証拠をキャプチャ、保存、使用するための時間的に忠実な表現がまだ欠けていることを示唆している。

論文の概要: Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

関連論文リスト