Fugu-MT 論文翻訳(概要): Beyond Simple Edits: Composed Video Retrieval with Dense Modifications

論文の概要: Beyond Simple Edits: Composed Video Retrieval with Dense Modifications

arxiv url: http://arxiv.org/abs/2508.14039v1
Date: Tue, 19 Aug 2025 17:59:39 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-20 15:36:32.041742
Title: Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
Title（参考訳）: シンプル編集を超えて:Dense Modificationを組み込んだビデオ検索
Authors: Omkar Thawakar, Dmitry Demidov, Ritesh Thawkar, Rao Muhammad Anwer, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan,
Abstract要約: 多様なビデオセグメントにまたがる細粒度および構成されたアクションをキャプチャする新しいデータセットを提案する。 Dense-WebVid-CoVRは、1.6万のサンプルからできており、修正テキストは既存のものより約7倍多い。我々は,Cross-Attention (CA) 融合により視覚情報とテキスト情報を統合した新しいモデルを開発した。
参考スコア（独自算出の注目度）: 96.46069692338645
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content. The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. We further develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion using grounded text encoder, enabling precise alignment between dense query modifications and target videos. The proposed model achieves state-of-the-art results surpassing existing methods on all metrics. Notably, it achieves 71.3\% Recall@1 in visual+text setting and outperforms the state-of-the-art by 3.4\%, highlighting its efficacy in terms of leveraging detailed video descriptions and dense modification texts. Our proposed dataset, code, and model are available at :https://github.com/OmkarThawakar/BSE-CoVR
Abstract（参考訳）: 合成ビデオ検索は、クエリビデオと、特定の修正の詳細を詳述したテキスト記述に基づいて、対象の動画を検索しようとする難題である。標準的な検索フレームワークは、典型的には、きめ細かい構成クエリの複雑さや、きめ細かい設定での検索能力を制限する時間的理解のバリエーションを扱うのに苦労する。この問題に対処するために、さまざまなビデオセグメントにわたる細粒度と構成されたアクションの両方をキャプチャーし、検索したビデオコンテンツのより詳細な構成変更を可能にする新しいデータセットを提案する。提案されたデータセットは、Dense-WebVid-CoVRと呼ばれ、1.6万のサンプルと、既存のものより約7倍の高密度な修正テキストで構成されている。さらに、グラウンドドテキストエンコーダを用いたCross-Attention (CA)融合による視覚情報とテキスト情報を統合し、密集したクエリ修正とターゲットビデオの正確なアライメントを可能にする新しいモデルを開発する。提案したモデルでは,すべての指標において既存の手法を上回り,最先端の結果が得られる。特に、ビジュアル+テキスト設定において71.3\%のRecall@1を達成し、最先端の3.4\%を上回り、詳細なビデオ記述と高密度な修正テキストを活用するという点でその有効性を強調している。提案したデータセット、コード、モデルは、https://github.com/OmkarThawakar/BSE-CoVRで利用可能である。

論文の概要: Beyond Simple Edits: Composed Video Retrieval with Dense Modifications

関連論文リスト