Fugu-MT 論文翻訳(概要): Identity-Consistent Video Generation under Large Facial-Angle Variations

論文の概要: Identity-Consistent Video Generation under Large Facial-Angle Variations

arxiv url: http://arxiv.org/abs/2603.21299v1
Date: Sun, 22 Mar 2026 15:54:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.341361
Title: Identity-Consistent Video Generation under Large Facial-Angle Variations
Title（参考訳）: 顔のアングル変化が大きい場合のアイデンティティ一貫性ビデオ生成
Authors: Bin Hu, Zipeng Qi, Guoxi Huang, Zunnan Xu, Ruicheng Zhang, Chongjie Ye, Jun Zhou, Xiu Li, Jingdong Wang,
Abstract要約: シングルビュー参照ビデオ手法は、大きな顔-角のバリエーションの下でアイデンティティの一貫性を維持するのにしばしば苦労する。我々は,マルチビュー条件付きフレームワークである$mathrmMv2mathrmID$を提案する。動作の自然性を維持しながらアイデンティティの整合性を大幅に向上し,クロスペアデータを用いた既存手法よりも優れていた。
参考スコア（独自算出の注目度）: 43.89758583859639
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Single-view reference-to-video methods often struggle to preserve identity consistency under large facial-angle variations. This limitation naturally motivates the incorporation of multi-view facial references. However, simply introducing additional reference images exacerbates the \textit{copy-paste} problem, particularly the \textbf{\textit{view-dependent copy-paste}} artifact, which reduces facial motion naturalness. Although cross-paired data can alleviate this issue, collecting such data is costly. To balance the consistency and naturalness, we propose $\mathrm{Mv}^2\mathrm{ID}$, a multi-view conditioned framework under in-paired supervision. We introduce a region-masking training strategy to prevent shortcut learning and extract essential identity features by encouraging the model to aggregate complementary identity cues across views. In addition, we design a reference decoupled-RoPE mechanism that assigns distinct positional encoding to video and conditioning tokens for better modeling of their heterogeneous properties. Furthermore, we construct a large-scale dataset with diverse facial-angle variations and propose dedicated evaluation metrics for identity consistency and motion naturalness. Extensive experiments demonstrate that our method significantly improves identity consistency while maintaining motion naturalness, outperforming existing approaches trained with cross-paired data.
Abstract（参考訳）: シングルビュー参照ビデオ手法は、大きな顔-角のバリエーションの下でアイデンティティの一貫性を維持するのにしばしば苦労する。この制限は、自然に多視点顔参照の組み入れを動機付けている。しかし、単に参照画像を追加するだけで、特に \textbf{\textit{view-dependent copy-paste}} アーティファクトの \textit{copy-paste} 問題が悪化し、顔の動きの自然性が低下する。クロスペアデータはこの問題を緩和することができるが、そのようなデータ収集にはコストがかかる。一貫性と自然性のバランスをとるために,マルチビュー条件付きフレームワークである$\mathrm{Mv}^2\mathrm{ID}$を提案する。本稿では,各ビューに相補的なアイデンティティの手がかりを集約することで,近距離学習を防止し,本質的なアイデンティティの特徴を抽出するための地域マスキングトレーニング戦略を提案する。さらに,ビデオやコンディショニングトークンに異なる位置エンコーディングを割り当てる参照デカップリング・RoPE機構を設計し,その異種特性のモデル化を改良する。さらに,多様な顔-角のバリエーションを持つ大規模データセットを構築し,アイデンティティの整合性や動きの自然性を評価するための専用評価指標を提案する。大規模な実験により,動作の自然性を維持しながらアイデンティティの整合性を著しく向上し,クロスペアデータを用いた既存手法よりも優れていた。

論文の概要: Identity-Consistent Video Generation under Large Facial-Angle Variations

関連論文リスト