Fugu-MT 論文翻訳(概要): HarmoView: Harmonizing Multi-View Constraints for Identity-Consistent Video Generation

論文の概要: HarmoView: Harmonizing Multi-View Constraints for Identity-Consistent Video Generation

arxiv url: http://arxiv.org/abs/2606.10839v1
Date: Tue, 09 Jun 2026 13:26:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-10 15:40:58.519212
Title: HarmoView: Harmonizing Multi-View Constraints for Identity-Consistent Video Generation
Title（参考訳）: HarmoView: アイデンティティ一貫性のあるビデオ生成のためのマルチビュー制約の調和
Authors: Cong Wang, Zhentao Yu, Hongmei Wang, Weicong Liang, Zixiang Zhou, Zilin Yang, Jiarong Ou, Rui Chen, Yuan Zhou, Qinglin Lu,
Abstract要約: HarmoViewは、アイデンティティ一貫性のあるビデオ生成のための堅牢なフレームワークである。 3つのアーキテクチャの洗練と、段階的なトレーニングカリキュラムを通じて、マルチビューのキューを統合している。 HarmoViewは、オープンソースベースラインを著しく上回り、主要なクローズドソースエンジンとマッチする。
参考スコア（独自算出の注目度）: 21.0568663910476
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current identity-consistent video generation methods struggle to preserve appearance fidelity under large viewpoint changes. While introducing multi-view reference input offers a natural solution, progress remains constrained by the lack of effective frameworks for multi-view inputs and the scarcity of multi-view data. We address these challenges by proposing HarmoView, a robust framework for identity-consistent video generation that effectively integrates multi-view cues through three architectural refinements complemented by a staged training curriculum. Specifically, we first introduce Multi-level Feature Injection to anchor identity fidelity; by injecting raw ViT features from frontal references alongside text tokens via cross-attention, MFI provides persistent low-level appearance anchors that complement the high-level identity features within DiT blocks, leading to enhanced identity preservation. Then, we employ learnable proxy tokens to unify heterogeneous reference layouts across single-/multi-view settings while simultaneously resolving the reference-view mismatch problem. Jump-RoPE is further developed for identity-wise feature isolation to reduce identity crosstalk. To activate these structural capabilities while preserving the original generative priors, we propose the Progressive View Curriculum. This four-stage training strategy employs view dropout to facilitate a stable transition from vanilla T2V generation to high-fidelity, identity-persistent spatial reasoning. Furthermore, we construct a large-scale multi-view dataset to address the issue of data scarcity. Extensive evaluation on our multi-view benchmark, comprising 100 manually-curated cases spanning 52 unique identities, demonstrates that HarmoView significantly outperforms open-source baselines and matches leading closed-source engines, achieving state-of-the-art performance in identity-consistent video generation.
Abstract（参考訳）: 現在のアイデンティティに一貫性のあるビデオ生成手法は、大きな視点の変化の下で外観の忠実さを維持するのに苦労している。マルチビュー参照インプットの導入は自然な解決策であるが、マルチビューインプットの効果的なフレームワークの欠如と、マルチビューデータの不足により、進歩は依然として制限されている。この課題に対処するために,HalmoViewを提案する。HalmoViewは,段階的なトレーニングカリキュラムによって補完される3つのアーキテクチャ改善を通じて,マルチビューキューを効果的に統合する,アイデンティティ一貫性のあるビデオ生成のための堅牢なフレームワークである。具体的には、まず、マルチレベル特徴注入(multi-level Feature Injection)を導入してアイデンティティをアンカーする; テキストトークンとクロスアテンションを介してフロント参照から生のViT機能をインジェクトすることにより、MFIは、DiTブロック内の高レベル特徴を補完する永続的な低レベル外観アンカーを提供し、ID保存の強化につながる。そして、学習可能なプロキシトークンを用いて、参照-ビューミスマッチ問題を同時に解決しながら、単一/複数ビュー設定で異種参照レイアウトを統一する。 Jump-RoPEは、アイデンティティのクロストークを減らすために、アイデンティティワイドな特徴分離のためにさらに開発されている。そこで本研究では,これらの構造的機能を初期生成前の状態を維持しつつ活性化させるため,プログレッシブ・ビュー・カリキュラムを提案する。この4段階のトレーニング戦略では、ビュードロップアウトを使用して、バニラT2V生成から高忠実でアイデンティティを持続する空間推論への安定した遷移を促進する。さらに,データ不足問題に対処するため,大規模マルチビューデータセットを構築した。 52のユニークなアイデンティティにまたがる手作業による100のケースからなるマルチビューベンチマークの大規模な評価は、HarmoViewがオープンソースベースラインを著しく上回り、主要なクローズドソースエンジンに匹敵し、アイデンティティ一貫性のあるビデオ生成における最先端のパフォーマンスを達成することを実証している。

論文の概要: HarmoView: Harmonizing Multi-View Constraints for Identity-Consistent Video Generation

関連論文リスト