Fugu-MT 論文翻訳(概要): DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

論文の概要: DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

arxiv url: http://arxiv.org/abs/2601.01425v1
Date: Sun, 04 Jan 2026 08:07:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-06 16:25:22.340481
Title: DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer
Title（参考訳）: DreamID-V:DreamID-V:Deffusion Transformerによる高精細な顔スワッピングのための画像とビデオのギャップを埋める
Authors: Xu Guo, Fulong Ye, Xinghui Li, Pengqi Tu, Pengze Zhang, Qichao Sun, Songtao Zhao, Xiangwang Hou, Qian He,
Abstract要約: Video Face Swapping (VFS)は、ターゲットのビデオにソースIDをシームレスに注入する必要がある。既存の方法は、時間的一貫性を維持しながら、アイデンティティの類似性と属性の保存を維持するのに苦労する。本稿では,画像顔スワッピングの優位性をビデオ領域にシームレスに転送するための包括的フレームワークを提案する。
参考スコア（独自算出の注目度）: 21.788582116033684
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video while meticulously preserving the original pose, expression, lighting, background, and dynamic information. Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency. To address the challenge, we propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping (IFS) to the video domain. We first introduce a novel data pipeline SyncID-Pipe that pre-trains an Identity-Anchored Video Synthesizer and combines it with IFS models to construct bidirectional ID quadruplets for explicit supervision. Building upon paired data, we propose the first Diffusion Transformer-based framework DreamID-V, employing a core Modality-Aware Conditioning module to discriminatively inject multi-model conditions. Meanwhile, we propose a Synthetic-to-Real Curriculum mechanism and an Identity-Coherence Reinforcement Learning strategy to enhance visual realism and identity consistency under challenging scenarios. To address the issue of limited benchmarks, we introduce IDBench-V, a comprehensive benchmark encompassing diverse scenes. Extensive experiments demonstrate DreamID-V outperforms state-of-the-art methods and further exhibits exceptional versatility, which can be seamlessly adapted to various swap-related tasks.
Abstract（参考訳）: Video Face Swapping (VFS)は、オリジナルポーズ、表情、照明、背景、動的情報を注意深く保存しながら、ソースIDをターゲットビデオにシームレスに注入する必要がある。既存の方法は、時間的一貫性を維持しながら、アイデンティティの類似性と属性の保存を維持するのに苦労する。この課題に対処するために、画像顔スワッピング(IFS)の優位性をビデオ領域にシームレスに転送する包括的なフレームワークを提案する。まず、Identity-Anchored Video Synthesizerを事前学習し、IFSモデルと組み合わせて双方向ID四重奏曲を明示的な監視のために構築する、新しいデータパイプラインSyncID-Pipeを紹介する。本稿では,Deffusion Transformerをベースとした最初のフレームワークDreamID-Vを提案する。一方,難解なシナリオ下での視覚リアリズムとアイデンティティの整合性を高めるために,Synthetic-to-Real Curriculum機構とIdentity-Coherence Reinforcement Learning戦略を提案する。限られたベンチマークの問題に対処するために、多様なシーンを含む包括的なベンチマークであるIDBench-Vを導入する。大規模な実験では、DreamID-Vは最先端の手法よりも優れており、また、様々なスワップ関連タスクにシームレスに適応できる優れた万能性を示す。

論文の概要: DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

関連論文リスト