Fugu-MT 論文翻訳(概要): CAST: Modeling Visual State Transitions for Consistent Video Retrieval

論文の概要: CAST: Modeling Visual State Transitions for Consistent Video Retrieval

arxiv url: http://arxiv.org/abs/2603.08648v1
Date: Mon, 09 Mar 2026 17:26:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:16.605894
Title: CAST: Modeling Visual State Transitions for Consistent Video Retrieval
Title（参考訳）: CAST: 一貫性のあるビデオ検索のための視覚状態遷移のモデル化
Authors: Yanqing Liu, Yingcheng Liu, Fanghong Dong, Budianto Budianto, Cihang Xie, Yan Jiao,
Abstract要約: 一貫性ビデオ検索のタスクを形式化し,YouCook2,COIN,CrossTaskにまたがる診断ベンチマークを導入する。 CAST(Context-Aware State Transition)は,多様な凍結視覚言語埋め込み空間に対応する軽量なプラグイン・アンド・プレイアダプタである。
参考スコア（独自算出の注目度）: 24.93764000962773
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update ($Δ$) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.
Abstract（参考訳）: ビデオコンテンツ作成が長文物語へと移行するにつれ、短いクリップをコヒーレントなストーリーラインに構成することがますます重要になる。しかし、一般的な検索定式化は、状態とアイデンティティの整合性を無視しながら、局所的な意味的アライメントを優先し、推論時に文脈に依存しないままである。この構造的制限に対処するため、一貫性ビデオ検索(CVR)のタスクを形式化し、YouCook2、COIN、CrossTaskにまたがる診断ベンチマークを導入する。 CAST(Context-Aware State Transition)は,多様な凍結視覚言語埋め込み空間に対応する軽量なプラグイン・アンド・プレイアダプタである。視覚的履歴から状態条件の残留更新(Δ$)を予測することで、CASTは潜伏状態の進化に対して明示的な帰納的バイアスを導入する。大規模な実験によると、CASTはYouCook2とCrossTaskのパフォーマンスを改善し、COINで競争力を維持し、さまざまな基盤のバックボーンでゼロショットベースラインを一貫して上回っている。さらに、CASTはブラックボックスビデオ生成候補(例えばVeoから)に対して有用なリランク信号を提供し、より時間的に一貫性のある継続を促進する。

論文の概要: CAST: Modeling Visual State Transitions for Consistent Video Retrieval

関連論文リスト