Related papers: EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation

URL: http://arxiv.org/abs/2412.01254v2
Date: Thu, 13 Mar 2025 08:32:46 GMT
Title: EmojiDiff: Advanced Facial Expression Control with High Identity Preservation in Portrait Generation
Authors: Liangwei Jiang, Ruida Li, Zhifeng Zhang, Shuo Fang, Chenguang Ma,
Abstract summary: We introduce EmojiDiff, the first end-to-end solution that enables simultaneous control of extremely detailed expression (RGB-level) and high-fidelity identity in portrait generation.<n>For decoupled training, we innovate ID-irrelevant Data Iteration (IDI) to synthesize cross-identity expression pairs.<n>We also present ID-enhanced Contrast Alignment (ICA) for further fine-tuning.
Score: 8.314556078632412
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper aims to bring fine-grained expression control while maintaining high-fidelity identity in portrait generation. This is challenging due to the mutual interference between expression and identity: (i) fine expression control signals inevitably introduce appearance-related semantics (e.g., facial contours, and ratio), which impact the identity of the generated portrait; (ii) even coarse-grained expression control can cause facial changes that compromise identity, since they all act on the face. These limitations remain unaddressed by previous generation methods, which primarily rely on coarse control signals or two-stage inference that integrates portrait animation. Here, we introduce EmojiDiff, the first end-to-end solution that enables simultaneous control of extremely detailed expression (RGB-level) and high-fidelity identity in portrait generation. To address the above challenges, EmojiDiff adopts a two-stage scheme involving decoupled training and fine-tuning. For decoupled training, we innovate ID-irrelevant Data Iteration (IDI) to synthesize cross-identity expression pairs by dividing and optimizing the processes of maintaining expression and altering identity, thereby ensuring stable and high-quality data generation. Training the model with this data, we effectively disentangle fine expression features in the expression template from other extraneous information (e.g., identity, skin). Subsequently, we present ID-enhanced Contrast Alignment (ICA) for further fine-tuning. ICA achieves rapid reconstruction and joint supervision of identity and expression information, thus aligning identity representations of images with and without expression control. Experimental results demonstrate that our method remarkably outperforms counterparts, achieves precise expression control with highly maintained identity, and generalizes well to various diffusion models.

Related papers

Robust ID-Specific Face Restoration via Alignment Learning [18.869593414569206]
We present Robust ID-Specific Face Restoration (RIDFR), a novel ID-specific face restoration framework based on diffusion models.<n>RIDFR incorporates Alignment Learning, which aligns the restoration results from multiple references with the same identity in order to suppress the interference of ID-irrelevant face semantics.<n>Experiments demonstrate that our framework outperforms the state-of-the-art methods, reconstructing high-quality ID-specific results with high identity fidelity and demonstrating strong robustness.
arXiv Detail & Related papers (2025-07-15T03:16:12Z)
FaceCrafter: Identity-Conditional Diffusion with Disentangled Control over Facial Pose, Expression, and Emotion [31.56574795895158]
We propose a novel identity-conditional diffusion model that allows precise control over pose, expression, and emotion without compromising identity preservation.<n>Our method surpasses existing approaches in terms of control accuracy over pose, expression, and emotion, while also improving generative diversity under identity-only conditioning.
arXiv Detail & Related papers (2025-05-21T09:43:21Z)
DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning [8.184155602678754]
DreamID is a diffusion-based face swapping model that achieves high levels of ID similarity, attribute preservation, image fidelity, and fast inference speed. We propose an improved diffusion-based model architecture comprising SwapNet, FaceNet, and ID Adapter. DreamID outperforms state-of-the-art methods in terms of identity similarity, pose and expression preservation, and image fidelity.
arXiv Detail & Related papers (2025-04-20T06:53:00Z)
Removing Averaging: Personalized Lip-Sync Driven Characters Based on Identity Adapter [10.608872317957026]
"lip averaging" phenomenon occurs when a model fails to preserve subtle facial details when dubbing unseen in-the-wild videos. We propose UnAvgLip, which extracts identity embeddings from reference videos to generate highly faithful facial sequences.
arXiv Detail & Related papers (2025-03-09T02:36:31Z)
ID$^3$: Identity-Preserving-yet-Diversified Diffusion Models for Synthetic Face Recognition [60.15830516741776]
Synthetic face recognition (SFR) aims to generate datasets that mimic the distribution of real face data. We introduce a diffusion-fueled SFR model termed $textID3$. $textID3$ employs an ID-preserving loss to generate diverse yet identity-consistent facial appearances.
arXiv Detail & Related papers (2024-09-26T06:46:40Z)
Towards Localized Fine-Grained Control for Facial Expression Generation [54.82883891478555]
Humans, particularly their faces, are central to content generation due to their ability to convey rich expressions and intent. Current generative models mostly generate flat neutral expressions and characterless smiles without authenticity. We propose the use of AUs (action units) for facial expression control in face generation.
arXiv Detail & Related papers (2024-07-25T18:29:48Z)
ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning [57.91881829308395]
Identity-preserving text-to-image generation (ID-T2I) has received significant attention due to its wide range of application scenarios like AI portrait and advertising. We present textbfID-Aligner, a general feedback learning framework to enhance ID-T2I performance.
arXiv Detail & Related papers (2024-04-23T18:41:56Z)
Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm [31.06269858216316]
We propose Infinite-ID, an ID-semantics decoupling paradigm for identity-preserved personalization. We introduce an identity-enhanced training, incorporating an additional image cross-attention module to capture sufficient ID information. We also introduce a feature interaction mechanism that combines a mixed attention module with an AdaIN-mean operation to seamlessly merge the two streams.
arXiv Detail & Related papers (2024-03-18T13:39:53Z)
Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation [21.739328335601716]
This paper focuses on inserting accurate and interactive ID embedding into the Stable Diffusion Model for personalized generation. We propose a face-wise attention loss to fit the face region instead of entangling ID-unrelated information, such as face layout and background. Our results exhibit superior ID accuracy, text-based manipulation ability, and generalization compared to previous methods.
arXiv Detail & Related papers (2024-01-31T11:52:33Z)
Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation [34.72612800373437]
In human-centric content generation, pre-trained text-to-image models struggle to produce user-wanted portrait images. We propose a novel multi-modal face generation framework, capable of simultaneous identity-expression control and more fine-grained expression synthesis.
arXiv Detail & Related papers (2024-01-02T13:28:39Z)
PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization [92.90392834835751]
PortraitBooth is designed for high efficiency, robust identity preservation, and expression-editable text-to-image generation. PortraitBooth eliminates computational overhead and mitigates identity distortion. It incorporates emotion-aware cross-attention control for diverse facial expressions in generated images.
arXiv Detail & Related papers (2023-12-11T13:03:29Z)
When StyleGAN Meets Stable Diffusion: a $\mathscr{W}_+$ Adapter for Personalized Image Generation [60.305112612629465]
Text-to-image diffusion models have excelled in producing diverse, high-quality, and photo-realistic images. We present a novel use of the extended StyleGAN embedding space $mathcalW_+$ to achieve enhanced identity preservation and disentanglement for diffusion models. Our method adeptly generates personalized text-to-image outputs that are not only compatible with prompt descriptions but also amenable to common StyleGAN editing directions.
arXiv Detail & Related papers (2023-11-29T09:05:14Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
Disentangling Identity and Pose for Facial Expression Recognition [54.50747989860957]
We propose an identity and pose disentangled facial expression recognition (IPD-FER) model to learn more discriminative feature representation. For identity encoder, a well pre-trained face recognition model is utilized and fixed during training, which alleviates the restriction on specific expression training data. By comparing the difference between synthesized neutral and expressional images of the same individual, the expression component is further disentangled from identity and pose.
arXiv Detail & Related papers (2022-08-17T06:48:13Z)
Mutual Information Regularized Identity-aware Facial ExpressionRecognition in Compressed Video [27.602648102881535]
We propose a novel collaborative min-min game for mutual information (MI) minimization in latent space. We do not need the identity label or multiple expression samples from the same person for identity elimination. Our solution can achieve comparable or better performance than the recent decoded image-based methods.
arXiv Detail & Related papers (2020-10-20T21:42:18Z)
LEED: Label-Free Expression Editing via Disentanglement [57.09545215087179]
LEED framework is capable of editing the expression of both frontal and profile facial images without requiring any expression label. Two novel losses are designed for optimal expression disentanglement and consistent synthesis.
arXiv Detail & Related papers (2020-07-17T13:36:15Z)
Fine-Grained Expression Manipulation via Structured Latent Space [30.789513209376032]
We propose an end-to-end expression-guided generative adversarial network (EGGAN) to manipulate fine-grained expressions. Our method can manipulate fine-grained expressions, and generate continuous intermediate expressions between source and target expressions.
arXiv Detail & Related papers (2020-04-21T06:18:34Z)
Fine-grained Image-to-Image Transformation towards Visual Recognition [102.51124181873101]
We aim at transforming an image with a fine-grained category to synthesize new images that preserve the identity of the input image. We adopt a model based on generative adversarial networks to disentangle the identity related and unrelated factors of an image. Experiments on the CompCars and Multi-PIE datasets demonstrate that our model preserves the identity of the generated images much better than the state-of-the-art image-to-image transformation models.
arXiv Detail & Related papers (2020-01-12T05:26:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.