Fugu-MT 論文翻訳(概要): Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation

論文の概要: Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation

arxiv url: http://arxiv.org/abs/2606.02441v1
Date: Mon, 01 Jun 2026 16:12:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:32.49244
Title: Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation
Title（参考訳）: 身元保存型テキスト・ビデオ生成のための空間的非結合参照条件
Authors: Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He, Lizhuang Ma, Jiangning Zhang,
Abstract要約: アイデンティティ保存ビデオ生成(IPVG)は、参照IDを保持しながらテキストプロンプトに従う高忠実度ビデオの合成を目的としている。そこで我々は,ST-DRCを提案する。ST-DRCは,個人認証を保存したテキスト・ビデオ生成のための効果的な空間的疎結合参照条件作成フレームワークである。 LTX-2.3 上に構築した軽量な設計により,ST-DRC は強いアイデンティティ保存,迅速なアライメント,時間的整合性,映像品質を実現する。
参考スコア（独自算出の注目度）: 79.94088803584262
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Identity-preserving video generation (IPVG) aims to synthesize high-fidelity videos that follow text prompts while faithfully preserving a reference identity. Despite recent progress, existing IPVG methods still struggle to balance high-level semantic control and low-level identity fidelity. To bridge this gap, we propose ST-DRC, an effective Spatial-Temporal Decoupled Reference Conditioning framework for identity-preserving text-to-video generation. At the framework level, ST-DRC performs latent in-context feature injection by encoding the reference image with the video VAE and concatenating it with noisy video latents, enabling rich low-level identity details to be accessed without additional adapters. To separate identity-aware reference retrieval from appearance copying, we introduce TASS-RoPE, a Temporal-Adjacent Spatial-Shifted RoPE scheme that places reference tokens near the video sequence in time but shifts them in space, allowing reference information to flow through spatio-temporal attention while suppressing pixel-level copy-paste shortcuts. To further prevent shortcut learning and strengthen the otherwise diluted identity supervision in the diffusion objective, we combine appearance-invariant reference augmentation with face-guided identity objectives, encouraging the model to preserve identity under variations in color, pose, and layout. At inference time, we introduce a three-stream reference classifier-free guidance strategy that independently controls text adherence and reference fidelity. Experiments demonstrate that ST-DRC achieves strong identity preservation, prompt alignment, temporal consistency, and video quality with a lightweight design built on LTX-2.3. Our method ranks among the top submissions in the facial identity-preserving video generation track, validating the effectiveness of spatial-temporal decoupled reference conditioning.
Abstract（参考訳）: アイデンティティ保存ビデオ生成(IPVG)は、参照IDを忠実に保存しながらテキストプロンプトに従う高忠実度ビデオの合成を目的としている。近年の進歩にもかかわらず、既存のIPVG法はハイレベルなセマンティックコントロールと低レベルなアイデンティティの忠実さのバランスをとるのに苦慮している。このギャップを埋めるため,ST-DRCを提案する。フレームワークレベルでは、ST-DRCは、ビデオVAEと参照画像を符号化し、ノイズの多いビデオラテントと結合することにより、遅延したコンテキスト内フィーチャ注入を行い、追加のアダプタなしでリッチな低レベルアイデンティティの詳細にアクセスできるようにする。画像列近傍に参照トークンを配置するが空間的にシフトするTASS-RoPE方式を導入し,画素レベルのコピー-ペーストショートカットを抑えつつ,時空間の注意を通して参照情報を流れるようにした。さらに,拡散目標におけるショートカット学習の防止と,それ以外は希薄なアイデンティティ管理の強化を目的として,外観不変な参照拡張と顔誘導されたアイデンティティ目的を組み合わせ,色,ポーズ,レイアウトの変動の下でのアイデンティティの保存をモデルに奨励する。推論時に,テキストの付着度と参照忠実度を独立に制御する3ストリーム参照分類器フリーガイダンス戦略を導入する。実験の結果、ST-DRCはLTX-2.3をベースとした軽量な設計で、強力なアイデンティティ保存、迅速なアライメント、時間的一貫性、ビデオ品質を実現することが示された。提案手法は顔の身元を保存したビデオ生成トラックの上位にランクされ,空間的時間的疎結合参照条件の有効性が検証された。

論文の概要: Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation

関連論文リスト