Controllable Face Synthesis with Semantic Latent Diffusion Models
- URL: http://arxiv.org/abs/2403.12743v2
- Date: Tue, 30 Jul 2024 07:54:01 GMT
- Title: Controllable Face Synthesis with Semantic Latent Diffusion Models
- Authors: Alex Ergasti, Claudio Ferrari, Tomaso Fontanini, Massimo Bertozzi, Andrea Prati,
- Abstract summary: We propose a SIS framework based on a novel Latent Diffusion Model architecture for human face generation and editing.
The proposed system utilizes both SPADE normalization and cross-attention layers to merge shape and style information and, by doing so, allows for a precise control over each of the semantic parts of the human face.
- Score: 6.438244172631555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic Image Synthesis (SIS) is among the most popular and effective techniques in the field of face generation and editing, thanks to its good generation quality and the versatility is brings along. Recent works attempted to go beyond the standard GAN-based framework, and started to explore Diffusion Models (DMs) for this task as these stand out with respect to GANs in terms of both quality and diversity. On the other hand, DMs lack in fine-grained controllability and reproducibility. To address that, in this paper we propose a SIS framework based on a novel Latent Diffusion Model architecture for human face generation and editing that is both able to reproduce and manipulate a real reference image and generate diversity-driven results. The proposed system utilizes both SPADE normalization and cross-attention layers to merge shape and style information and, by doing so, allows for a precise control over each of the semantic parts of the human face. This was not possible with previous methods in the state of the art. Finally, we performed an extensive set of experiments to prove that our model surpasses current state of the art, both qualitatively and quantitatively.
Related papers
- Exploring Representation-Aligned Latent Space for Better Generation [86.45670422239317]
We introduce ReaLS, which integrates semantic priors to improve generation performance.
We show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric.
The enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
arXiv Detail & Related papers (2025-02-01T07:42:12Z) - JADE: Joint-aware Latent Diffusion for 3D Human Generative Modeling [62.77347895550087]
We introduce JADE, a generative framework that learns the variations of human shapes with fined-grained control.
Our key insight is a joint-aware latent representation that decomposes human bodies into skeleton structures.
To generate coherent and plausible human shapes under our proposed decomposition, we also present a cascaded pipeline.
arXiv Detail & Related papers (2024-12-29T14:18:35Z) - SHMT: Self-supervised Hierarchical Makeup Transfer via Latent Diffusion Models [29.430749386234414]
We propose a novel Self-supervised Hierarchical Makeup Transfer (SHMT) method via latent diffusion models.
SHMT works in a self-supervised manner, freeing itself from the misguidance of pseudo-paired data.
To accommodate a variety of makeup styles, hierarchical texture details are imprecise via a Laplacian pyramid.
arXiv Detail & Related papers (2024-12-15T05:29:07Z) - StyleDiT: A Unified Framework for Diverse Child and Partner Faces Synthesis with Style Latent Diffusion Transformer [11.83733187403255]
StyleDiT is a novel framework that integrates the strengths of StyleGAN with the diffusion model to generate high-quality and diverse kinship faces.
We introduce the Trait Guidance (RTG) mechanism, enabling independent control of influencing conditions.
We extend the application to an unexplored domain: predicting a partner's facial images using a child's image and one parent's image.
arXiv Detail & Related papers (2024-12-14T10:47:17Z) - LDFaceNet: Latent Diffusion-based Network for High-Fidelity Deepfake Generation [6.866014367868788]
This paper proposes a novel facial swapping module, termed as LDFaceNet (Latent Diffusion based Face Swapping Network)
It is based on a guided latent diffusion model that utilizes facial segmentation and facial recognition modules for a conditioned denoising process.
The results of this study demonstrate that the proposed method can generate extremely realistic and coherent images.
arXiv Detail & Related papers (2024-08-04T16:09:04Z) - JoReS-Diff: Joint Retinex and Semantic Priors in Diffusion Model for Low-light Image Enhancement [69.6035373784027]
Low-light image enhancement (LLIE) has achieved promising performance by employing conditional diffusion models.
Previous methods may neglect the importance of a sufficient formulation of task-specific condition strategy.
We propose JoReS-Diff, a novel approach that incorporates Retinex- and semantic-based priors as the additional pre-processing condition.
arXiv Detail & Related papers (2023-12-20T08:05:57Z) - Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models [13.019535928387702]
This paper presents Progressive Conditional Diffusion Models (PCDMs) that incrementally bridge the gap between person images under the target and source poses through three stages.
Both qualitative and quantitative results demonstrate the consistency and photorealism of our proposed PCDMs under challenging scenarios.
arXiv Detail & Related papers (2023-10-10T05:13:17Z) - Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional
Image Synthesis [62.07413805483241]
Steered Diffusion is a framework for zero-shot conditional image generation using a diffusion model trained for unconditional generation.
We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution.
arXiv Detail & Related papers (2023-09-30T02:03:22Z) - DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing [94.24479528298252]
DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision.
By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images.
We present a challenging benchmark dataset called DragBench to evaluate the performance of interactive point-based image editing methods.
arXiv Detail & Related papers (2023-06-26T06:04:09Z) - Multimodal-driven Talking Face Generation via a Unified Diffusion-based
Generator [29.58245990622227]
Multimodal-driven talking face generation refers to animating a portrait with the given pose, expression, and gaze transferred from the driving image and video, or estimated from the text and audio.
Existing methods ignore the potential of text modal, and their generators mainly follow the source-oriented feature paradigm coupled with unstable GAN frameworks.
We derive a novel paradigm free of unstable seesaw-style optimization, resulting in simple, stable, and effective training and inference schemes.
arXiv Detail & Related papers (2023-05-04T07:01:36Z) - Style-Hallucinated Dual Consistency Learning: A Unified Framework for
Visual Domain Generalization [113.03189252044773]
We propose a unified framework, Style-HAllucinated Dual consistEncy learning (SHADE), to handle domain shift in various visual tasks.
Our versatile SHADE can significantly enhance the generalization in various visual recognition tasks, including image classification, semantic segmentation and object detection.
arXiv Detail & Related papers (2022-12-18T11:42:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.