Fugu-MT 論文翻訳(概要): RefAlign: Representation Alignment for Reference-to-Video Generation

論文の概要: RefAlign: Representation Alignment for Reference-to-Video Generation

arxiv url: http://arxiv.org/abs/2603.25743v1
Date: Thu, 26 Mar 2026 17:59:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.434928
Title: RefAlign: Representation Alignment for Reference-to-Video Generation
Title（参考訳）: RefAlign: 参照ビデオ生成のための表現アライメント
Authors: Lei Wang, YuXin Song, Ge Wu, Haocheng Feng, Hang Zhou, Jingdong Wang, Yaxing Wang, jian Yang,
Abstract要約: RefAlignは、DiT参照ブランチ機能を視覚基礎モデルのセマンティック空間に整列する表現アライメントフレームワークである。 OpenS2V-Evalベンチマークの実験では、RefAlignがTotalScoreの最先端メソッドより優れていることが示されている。
参考スコア（独自算出の注目度）: 53.368296137314225
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.
Abstract（参考訳）: 参照ビデオ生成(R2V)は、テキストプロンプトと参照画像の両方を使用して生成プロセスを制限し、パーソナライズされた広告や仮想トライオンなどのアプリケーションを可能にする、制御可能なビデオ合成パラダイムである。実際には、既存のR2V手法は、通常、参照画像のVAE潜在表現と並行して、高レベルなセマンティックまたはクロスモーダルな特徴を導入し、それらを拡散変換器(DiT)に共同で供給する。これらの補助表現はセマンティックガイダンスを提供し、暗黙のアライメント信号として機能し、VAE潜在空間における画素レベルの情報漏洩を部分的に軽減することができる。しかし、コピー・ペースト・アーティファクトや、異種エンコーダ機能間のモダリティミスマッチに起因する多目的混同への対処には依然として苦労する可能性がある。本稿では,視覚基盤モデル(VFM)のセマンティック空間にDiT参照ブランチ機能を明示的にアライメントする表現アライメントフレームワークであるRefAlignを提案する。 RefAlignのコアは参照アライメントの損失であり、同一主題の参照特徴とVFM特徴を引き出してアイデンティティの整合性を改善すると同時に、異なる主題の対応する特徴を分離して意味的識別性を高める。このシンプルで効果的な戦略は、トレーニング中にのみ適用され、推論時のオーバーヘッドは発生せず、テキスト制御性と参照忠実度とのバランスが良くなる。 OpenS2V-Evalベンチマークの大規模な実験により、RefAlignはTotalScoreの現在の最先端メソッドよりも優れており、R2Vタスクに対する明示的な参照アライメントの有効性が検証されている。

論文の概要: RefAlign: Representation Alignment for Reference-to-Video Generation

関連論文リスト