Fugu-MT 論文翻訳(概要): Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

論文の概要: Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

arxiv url: http://arxiv.org/abs/2605.26441v1
Date: Tue, 26 May 2026 01:54:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:41.566954
Title: Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective
Title（参考訳）: ゲーム観から見る微弱に監督されたビデオ時間グラウンドの再考
Authors: Xiang Fang, Zeyu Xiong, Wanlong Fang, Xiaoye Qu, Chen Chen, Jianfeng Dong, Keke Tang, Pan Zhou, Yu Cheng, Daizong Liu,
Abstract要約: 本稿では,弱教師付きビデオ時間グラウンドの課題に対処する。我々は,この課題に新しいゲームの観点から取り組み,各視覚言語対間の不確実な関係を効果的に学習する。実験の結果,本手法はCharades-STAとActivityNet Captionの両方のデータセットにおいて優れた性能を示すことがわかった。
参考スコア（独自算出の注目度）: 86.60594418472238
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment proposal selection framework that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moment proposals. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: 1) Coarse-grained cross-modal learning: previous methods solely capture the global video-level alignment with the query, failing to model the detailed consistency between video frames and query words for accurately grounding the moment boundaries. 2) Complex moment proposals: their performance severely relies on the quality of proposals, which are also time-consuming and complicated for selection. To this end, in this paper, we make the first attempt to tackle this task from a novel game perspective, which effectively learns the uncertain relationship between each vision-language pair with diverse granularity and flexible combination for multi-level cross-modal interaction.Specifically, we creatively model each video frame and query word as game players with multivariate cooperative game theory to learn their contribution to the cross-modal similarity score. By quantifying the trend of frame-word cooperation within a coalition via the game-theoretic interaction, we are able to value all uncertain but possible correspondence between frames and words. Finally, instead of using moment proposals, we utilize the learned query-guided frame-wise scores for better moment localization.Experiments show that our method achieves superior performance on both Charades-STA and ActivityNet Caption datasets.
Abstract（参考訳）: 本稿では,弱教師付きビデオ時間グラウンドの課題に対処する。既存のアプローチは一般に、事前定義されたモーメント提案を評価するためにコントラスト学習と再構成パラダイムを利用するモーメント提案選択フレームワークに基づいている。彼らは大きな進歩を遂げましたが、現在のフレームワークは2つの必然的な問題を見落としています。 1) 粗粒度クロスモーダル学習: 従来手法では,ビデオフレームとクエリワード間の詳細な一貫性をモデル化できず,モーメント境界を正確に把握する必要があった。 2) 複雑なモーメントの提案:そのパフォーマンスは提案の質に大きく依存します。そこで,本稿では,この課題に対処する試みとして,多様な粒度と多段階の相互モーダルインタラクションのための柔軟な組み合わせを持つ視覚言語ペア間の不確実性について効果的に学習し,多変量協調ゲーム理論を持つゲームプレイヤーとして,各ビデオフレームとクエリワードを創造的にモデル化し,相互モーダル類似度スコアへの貢献度を学習する。ゲーム理論的相互作用による連立関係におけるフレーム・ワード協調の傾向を定量化することにより、フレームとワード間の不確実かつ可能な対応を評価できる。最後に、モーメントプロポーザルの代わりに、学習したクエリ誘導フレームワイズスコアを用いてモーメントローカライゼーションを改善する実験を行い、この手法がCharades-STAデータセットとActivityNet Captionデータセットの両方で優れた性能を発揮することを示す。

論文の概要: Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

関連論文リスト