Fugu-MT 論文翻訳(概要): SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding

論文の概要: SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding

arxiv url: http://arxiv.org/abs/2510.20622v1
Date: Thu, 23 Oct 2025 14:55:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:18.220996
Title: SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding
Title（参考訳）: SeViCES:長いビデオ理解のためのセマンティック・ビジュアル・エビデンス・コンセンサスの統合
Authors: Yuan Sheng, Yanbin Hao, Chenxu Li, Shuo Wang, Xiangnan He,
Abstract要約: 本稿では,効果的で信頼性の高いロングビデオ理解のためのフレームワークを提案する。 SeViCESはトレーニング不要でモデルに依存しない2つの重要なコンポーネントを導入している。長いビデオ理解ベンチマークの実験によると、SeViCESは精度と堅牢性の両方で最先端の手法を一貫して上回っている。
参考スコア（独自算出の注目度）: 36.30263540665245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long video understanding remains challenging due to its complex, diverse, and temporally scattered content. Although video large language models (Video-LLMs) can process videos lasting tens of minutes, applying them to truly long sequences is computationally prohibitive and often leads to unfocused or inconsistent reasoning. A promising solution is to select only the most informative frames, yet existing approaches typically ignore temporal dependencies or rely on unimodal evidence, limiting their ability to provide complete and query-relevant context. We propose a Semantic-Visual Consensus Evidence Selection (SeViCES) framework for effective and reliable long video understanding. SeViCES is training-free and model-agnostic, and introduces two key components. The Semantic-Visual Consensus Frame Selection (SVCFS) module selects frames through (1) a temporal-aware semantic branch that leverages LLM reasoning over captions, and (2) a cluster-guided visual branch that aligns embeddings with semantic scores via mutual information. The Answer Consensus Refinement (ACR) module further resolves inconsistencies between semantic- and visual-based predictions by fusing evidence and constraining the answer space. Extensive experiments on long video understanding benchmarks show that SeViCES consistently outperforms state-of-the-art methods in both accuracy and robustness, demonstrating the importance of consensus-driven evidence selection for Video-LLMs.
Abstract（参考訳）: 複雑で多様で時間的に散らばったコンテンツのため、長いビデオ理解は依然として困難である。ビデオ大言語モデル(Video-LLMs)は、何分間もビデオを処理することができるが、真に長いシーケンスにそれらを適用することは、計算的に禁止され、焦点を絞らない、一貫性のない推論につながることが多い。有望な解決策は、最も情報に富むフレームだけを選択することだが、既存のアプローチは通常、時間的依存を無視したり、非モジュアルなエビデンスに依存し、完全なクエリ関連コンテキストを提供する能力を制限する。本稿では,SeViCES(Semantic-Visual Consensus Evidence Selection)フレームワークを提案する。 SeViCESはトレーニング不要でモデルに依存しない2つの重要なコンポーネントを導入している。 SVCFS(Semantic-Visual Consensus Frame Selection)モジュールは,(1)キャプション上のLCM推論を利用する時間認識セマンティックブランチ,(2)埋め込みとセマンティックスコアを相互情報を介して協調するクラスタ誘導視覚ブランチを通じて,フレームを選択する。 Answer Consensus Refinement (ACR)モジュールは、エビデンスを融合させ、応答空間を制限することによって、意味論的および視覚的予測の不整合をさらに解決する。長大なビデオ理解ベンチマーク実験により、SeViCESは精度とロバスト性の両方において最先端の手法より一貫して優れており、ビデオ-LLMに対するコンセンサス駆動のエビデンス選択の重要性が示されている。

論文の概要: SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding

関連論文リスト