Fugu-MT 論文翻訳(概要): CoSTL: Comprehensive Spatial-Temporal Representation Learning for Moment Retrieval and Highlight Detection

論文の概要: CoSTL: Comprehensive Spatial-Temporal Representation Learning for Moment Retrieval and Highlight Detection

arxiv url: http://arxiv.org/abs/2606.01149v1
Date: Sun, 31 May 2026 10:36:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:29.281305
Title: CoSTL: Comprehensive Spatial-Temporal Representation Learning for Moment Retrieval and Highlight Detection
Title（参考訳）: CoSTL:モーメント検索と光検出のための総合的空間時間表現学習
Authors: Xin Dong, Wenjia Geng, Wenfeng Deng, Yansong Tang,
Abstract要約: ビデオモーメント検索(MR)とハイライト検出(HD)は、特定のモーメントをローカライズし、所定のテキストクエリに基づいてクリップワイズ関連性を推定することを目的としたビデオ分析において重要なタスクである。最近のアプローチでは、同様のビデオグラウンドタスクとして扱い、同じアーキテクチャを使って解決している。これらのタスクは、画像レベルでのきめ細かい理解と、ビデオ全体にわたる高レベルの時間的理解の両方を必要とする。既存のアプローチは主にフレームレベルの特徴を用いた時間的モデリングに重点を置いており、多くの場合、個々のフレーム内のテキストクエリに関連する豊富な視覚情報を無視している。
参考スコア（独自算出の注目度）: 36.404472837216346
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Moment Retrieval (MR) and Highlight Detection (HD) are crucial tasks in video analysis that aim to localize specific moments and estimate clip-wise relevance based on a given text query. Recent approaches treat them as similar video grounding tasks and use the same architecture to solve them. These tasks require both fine-grained comprehension at the image level and high-level temporal understanding across the entire video. Existing approaches have primarily focused on temporal modeling using frame-level features, often neglecting the rich visual information related to the text query within individual frames. This oversight leads to inaccurate grounding results. To address this limitation, we propose a Comprehensive Spatial-Temporal Representation Learning Framework (CoSTL), which captures both fine-grained image-level information and temporal dynamics. Specifically, CoSTL incorporates a text-driven progressive fine-grained image encoder, performing a two-step text-driven knowledge extraction process to learn fine-grained spatial representations. Furthermore, a multi-scale temporal perception module captures comprehensive spatial-temporal representations, enhancing the model's ability to process temporal dynamics. We demonstrate state-of-the-art performance on four public benchmarks: QVHighlights, Charades-STA, TACoS, and TVSum.
Abstract（参考訳）: ビデオモーメント検索(MR)とハイライト検出(HD)は、特定のモーメントをローカライズし、所定のテキストクエリに基づいてクリップワイズ関連性を推定することを目的としたビデオ分析において重要なタスクである。最近のアプローチでは、同様のビデオグラウンドタスクとして扱い、同じアーキテクチャを使って解決している。これらのタスクは、画像レベルでのきめ細かい理解と、ビデオ全体にわたる高レベルの時間的理解の両方を必要とする。既存のアプローチは主にフレームレベルの特徴を用いた時間的モデリングに重点を置いており、多くの場合、個々のフレーム内のテキストクエリに関連する豊富な視覚情報を無視している。この監視は不正確な接地結果をもたらす。この制限に対処するために、細粒度画像情報と時間ダイナミクスの両方をキャプチャする包括的空間時間表現学習フレームワーク(CoSTL)を提案する。特に、CoSTLはテキスト駆動のプログレッシブな微細な画像エンコーダを内蔵し、2段階のテキスト駆動の知識抽出プロセスを実行し、微細な空間表現を学習する。さらに、マルチスケールの時間知覚モジュールは、包括的な空間的時間的表現をキャプチャし、時間的ダイナミクスを処理するモデルの能力を高める。 QVHighlights、Charades-STA、TACoS、TVSumの4つの公開ベンチマークで最先端のパフォーマンスを示す。

論文の概要: CoSTL: Comprehensive Spatial-Temporal Representation Learning for Moment Retrieval and Highlight Detection

関連論文リスト