Related papers: DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning

DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning

URL: http://arxiv.org/abs/2504.14509v3
Date: Fri, 25 Apr 2025 03:48:24 GMT
Title: DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning
Authors: Fulong Ye, Miao Hua, Pengze Zhang, Xinghui Li, Qichao Sun, Songtao Zhao, Qian He, Xinglong Wu,
Abstract summary: DreamID is a diffusion-based face swapping model that achieves high levels of ID similarity, attribute preservation, image fidelity, and fast inference speed.<n>We propose an improved diffusion-based model architecture comprising SwapNet, FaceNet, and ID Adapter.<n>DreamID outperforms state-of-the-art methods in terms of identity similarity, pose and expression preservation, and image fidelity.
Score: 8.184155602678754
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we introduce DreamID, a diffusion-based face swapping model that achieves high levels of ID similarity, attribute preservation, image fidelity, and fast inference speed. Unlike the typical face swapping training process, which often relies on implicit supervision and struggles to achieve satisfactory results. DreamID establishes explicit supervision for face swapping by constructing Triplet ID Group data, significantly enhancing identity similarity and attribute preservation. The iterative nature of diffusion models poses challenges for utilizing efficient image-space loss functions, as performing time-consuming multi-step sampling to obtain the generated image during training is impractical. To address this issue, we leverage the accelerated diffusion model SD Turbo, reducing the inference steps to a single iteration, enabling efficient pixel-level end-to-end training with explicit Triplet ID Group supervision. Additionally, we propose an improved diffusion-based model architecture comprising SwapNet, FaceNet, and ID Adapter. This robust architecture fully unlocks the power of the Triplet ID Group explicit supervision. Finally, to further extend our method, we explicitly modify the Triplet ID Group data during training to fine-tune and preserve specific attributes, such as glasses and face shape. Extensive experiments demonstrate that DreamID outperforms state-of-the-art methods in terms of identity similarity, pose and expression preservation, and image fidelity. Overall, DreamID achieves high-quality face swapping results at 512*512 resolution in just 0.6 seconds and performs exceptionally well in challenging scenarios such as complex lighting, large angles, and occlusions.

Related papers

DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability [12.692129257068085]
We present DynamicID, a tuning-free framework that inherently facilitates both single-ID and multi-ID personalized generation.<n>Our key innovations include: 1) Semantic-Activated Attention (SAA), which employs query-level activation gating to minimize disruption to the base model when injecting ID features and achieve multi-ID personalization without requiring multi-ID samples during training; 2) Identity-Motion Reconfigurator (IMR), which applies feature-space manipulation to effectively disentangle facial motion and identity features, supporting flexible facial editing; and 3) a task-decoupled training paradigm that reduces data dependency, together with VariFace-10k,
arXiv Detail & Related papers (2025-03-09T08:16:19Z)
VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping [43.30061680192465]
We present the first diffusion-based framework specifically designed for video face swapping. Our approach incorporates a specially designed diffusion model coupled with a VidFaceVAE. Our framework achieves superior performance in identity preservation, temporal consistency, and visual quality compared to existing methods.
arXiv Detail & Related papers (2024-12-15T18:58:32Z)
EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation [8.314556078632412]
We introduce EmojiDiff, the first end-to-end solution that enables simultaneous control of extremely detailed expression (RGB-level) and high-fidelity identity in portrait generation.<n>For decoupled training, we innovate ID-irrelevant Data Iteration (IDI) to synthesize cross-identity expression pairs.<n>We also present ID-enhanced Contrast Alignment (ICA) for further fine-tuning.
arXiv Detail & Related papers (2024-12-02T08:24:11Z)
HiFiVFS: High Fidelity Video Face Swapping [35.49571526968986]
Face swapping aims to generate results that combine the identity from the source with attributes from the target.<n>We propose a high fidelity video face swapping framework, which leverages the strong generative capability and temporal prior of Stable Video Diffusion.<n>Our method achieves state-of-the-art (SOTA) in video face swapping, both qualitatively and quantitatively.
arXiv Detail & Related papers (2024-11-27T12:30:24Z)
ID$^3$: Identity-Preserving-yet-Diversified Diffusion Models for Synthetic Face Recognition [60.15830516741776]
Synthetic face recognition (SFR) aims to generate datasets that mimic the distribution of real face data. We introduce a diffusion-fueled SFR model termed $textID3$. $textID3$ employs an ID-preserving loss to generate diverse yet identity-consistent facial appearances.
arXiv Detail & Related papers (2024-09-26T06:46:40Z)
FPGA: Flexible Portrait Generation Approach [11.002947043723617]
We propose a comprehensive system called FPGA to construct a million-level multi-modal dataset IDZoom for training.<n>FPGA consists of Multi-Mode Fusion training strategy (MMF) and DDIM Inversion based ID Restoration inference framework (DIIR)<n>DIIR is plug-and-play and can be applied to any diffusion-based portrait generation method to enhance their performance.
arXiv Detail & Related papers (2024-08-17T16:34:03Z)
OrientDream: Streamlining Text-to-3D Generation with Explicit Orientation Control [66.03885917320189]
OrientDream is a camera orientation conditioned framework for efficient and multi-view consistent 3D generation from textual prompts. Our strategy emphasizes the implementation of an explicit camera orientation conditioned feature in the pre-training of a 2D text-to-image diffusion module. Our experiments reveal that our method not only produces high-quality NeRF models with consistent multi-view properties but also achieves an optimization speed significantly greater than existing methods.
arXiv Detail & Related papers (2024-06-14T13:16:18Z)
Synthesizing Efficient Data with Diffusion Models for Person Re-Identification Pre-Training [51.87027943520492]
We present a novel paradigm Diffusion-ReID to efficiently augment and generate diverse images based on known identities. Benefiting from our proposed paradigm, we first create a new large-scale person Re-ID dataset Diff-Person, which consists of over 777K images from 5,183 identities.
arXiv Detail & Related papers (2024-06-10T06:26:03Z)
ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning [57.91881829308395]
Identity-preserving text-to-image generation (ID-T2I) has received significant attention due to its wide range of application scenarios like AI portrait and advertising. We present textbfID-Aligner, a general feedback learning framework to enhance ID-T2I performance.
arXiv Detail & Related papers (2024-04-23T18:41:56Z)
ToddlerDiffusion: Interactive Structured Image Generation with Cascaded Schrödinger Bridge [63.00793292863]
ToddlerDiffusion is a novel approach to decomposing the complex task of RGB image generation into simpler, interpretable stages. Our method, termed ToddlerDiffusion, cascades modality-specific models, each responsible for generating an intermediate representation. ToddlerDiffusion consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-11-24T15:20:01Z)
Camera-aware Proxies for Unsupervised Person Re-Identification [60.26031011794513]
This paper tackles the purely unsupervised person re-identification (Re-ID) problem that requires no annotations. We propose to split each single cluster into multiple proxies and each proxy represents the instances coming from the same camera. Based on the camera-aware proxies, we design both intra- and inter-camera contrastive learning components for our Re-ID model.
arXiv Detail & Related papers (2020-12-19T12:37:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.