Fugu-MT 論文翻訳(概要): ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval

論文の概要: ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval

arxiv url: http://arxiv.org/abs/2604.17898v1
Date: Mon, 20 Apr 2026 07:17:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.747338
Title: ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval
Title（参考訳）: ReTrack: ビデオ検索のためのエビデンス駆動デュアルストリーム指向型アンカーキャリブレーションネットワーク
Authors: Zixu Li, Yupeng Hu, Zhiwei Chen, Qinlei Huang, Guozhi Qiu, Zhiheng Fu, Meng Liu,
Abstract要約: Composed Video Retrieval (CVR) はビデオ検索の新しいパラダイムとして登場した。伝統的な合成法は、合成された特徴を基準ビデオに偏りがちである。 ReTrackは、構成された機能の方向バイアスを校正することで、マルチモーダルクエリ理解を改善する最初のCVRフレームワークである。
参考スコア（独自算出の注目度）: 24.278296673415827
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid growth of video data, Composed Video Retrieval (CVR) has emerged as a novel paradigm in video retrieval and is receiving increasing attention from researchers. Unlike unimodal video retrieval methods, the CVR task takes a multi-modal query consisting of a reference video and a piece of modification text as input. The modification text conveys the user's intended alterations to the reference video. Based on this input, the model aims to retrieve the most relevant target video. In the CVR task, there exists a substantial discrepancy in information density between video and text modalities. Traditional composition methods tend to bias the composed feature toward the reference video, which leads to suboptimal retrieval performance. This limitation is significant due to the presence of three core challenges: (1) modal contribution entanglement, (2) explicit optimization of composed features, and (3) retrieval uncertainty. To address these challenges, we propose the evidence-dRivRn dual-sTream diRectionAl anChor calibration networK (ReTrack). ReTrack is the first CVR framework that improves multi-modal query understanding by calibrating directional bias in composed features. It consists of three key modules: Semantic Contribution Disentanglement, Composition Geometry Calibration, and Reliable Evidence-driven Alignment. Specifically, ReTrack estimates the semantic contribution of each modality to calibrate the directional bias of the composed feature. It then uses the calibrated directional anchors to compute bidirectional evidence that drives reliable composed-to-target similarity estimation. Moreover, ReTrack exhibits strong generalization to the Composed Image Retrieval (CIR) task, achieving SOTA performance across three benchmark datasets in both CVR and CIR scenarios. Codes are available at https://github.com/Lee-zixu/ReTrack
Abstract（参考訳）: 映像データの急速な成長に伴い、映像検索の新たなパラダイムとしてCVR(Composted Video Retrieval)が登場し、研究者の注目を集めている。非モーダルなビデオ検索方法とは異なり、CVRタスクは参照ビデオと修正テキストからなるマルチモーダルクエリを入力として取り込む。修正テキストは、ユーザの意図した変更を参照ビデオに伝達する。この入力に基づいて、このモデルは最も関連性の高いターゲットビデオを取得することを目的としている。 CVRタスクには、ビデオとテキストのモダリティ間の情報密度にかなりの差がある。従来の合成法では、合成された特徴を基準ビデオに偏りがちであり、それによって最適下検索性能が向上する。この制限は,(1) モーダルコントリビューションの絡み合い,(2) 合成特徴の明示的な最適化,(3) 検索の不確実性という3つの主要な課題が存在するために重要である。これらの課題に対処するために、エビデンス-dRivRnデュアル-sTream diRectionAl anChor calibration networK (ReTrack)を提案する。 ReTrackは、構成された機能の方向バイアスを校正することで、マルチモーダルクエリ理解を改善する最初のCVRフレームワークである。セマンティック・コントリビューション・ディスタングルメント(Semantic Contribution Disentanglement)、コンポジション・ジオメトリ・キャリブレーション(Compose Geometry Calibration)、信頼性に基づくアライメント(Reliable Evidence-driven Alignment)の3つの主要なモジュールで構成されている。具体的には、ReTrackは、各モードのセマンティックコントリビューションを推定し、合成された特徴の方向バイアスを校正する。次に、キャリブレーションされた方向アンカーを用いて双方向の証拠を計算し、信頼性の高い合成目標類似度推定を駆動する。さらに、ReTrackはComposted Image Retrieval(CIR)タスクへの強力な一般化を示し、CVRおよびCIRシナリオの3つのベンチマークデータセットでSOTAパフォーマンスを達成する。コードはhttps://github.com/Lee-zixu/ReTrackで入手できる。

論文の概要: ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval

関連論文リスト