Related papers: Towards Controllable Face Generation with Semantic Latent Diffusion Models

Towards Controllable Face Generation with Semantic Latent Diffusion Models

URL: http://arxiv.org/abs/2403.12743v1
Date: Tue, 19 Mar 2024 14:02:13 GMT
Title: Towards Controllable Face Generation with Semantic Latent Diffusion Models
Authors: Alex Ergasti, Claudio Ferrari, Tomaso Fontanini, Massimo Bertozzi, Andrea Prati,
Abstract summary: We propose a SIS framework based on a novel Latent Diffusion Model architecture for human face generation and editing. The proposed system utilizes both SPADE normalization and cross-attention layers to merge shape and style information and, by doing so, allows for a precise control over each of the semantic parts of the human face.
Score: 6.438244172631555
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Semantic Image Synthesis (SIS) is among the most popular and effective techniques in the field of face generation and editing, thanks to its good generation quality and the versatility is brings along. Recent works attempted to go beyond the standard GAN-based framework, and started to explore Diffusion Models (DMs) for this task as these stand out with respect to GANs in terms of both quality and diversity. On the other hand, DMs lack in fine-grained controllability and reproducibility. To address that, in this paper we propose a SIS framework based on a novel Latent Diffusion Model architecture for human face generation and editing that is both able to reproduce and manipulate a real reference image and generate diversity-driven results. The proposed system utilizes both SPADE normalization and cross-attention layers to merge shape and style information and, by doing so, allows for a precise control over each of the semantic parts of the human face. This was not possible with previous methods in the state of the art. Finally, we performed an extensive set of experiments to prove that our model surpasses current state of the art, both qualitatively and quantitatively.

Related papers

Reference-Guided Diffusion Inpainting For Multimodal Counterfactual Generation [55.2480439325792]
Safety-critical applications, such as autonomous driving and medical image analysis, require extensive multimodal data for rigorous testing.<n>This work introduces two novel methods for synthetic data generation in autonomous driving and medical image analysis, namely MObI and AnydoorMed, respectively.
arXiv Detail & Related papers (2025-07-30T19:43:47Z)
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z)
Bringing Diversity from Diffusion Models to Semantic-Guided Face Asset Generation [10.402456492958457]
This work aims to demonstrate that a semantically controllable generative network can provide enhanced control over the digital face modeling process. We introduce a novel data generation pipeline that creates a high-quality 3D face database using a pre-trained diffusion model. We introduce a comprehensive system designed for creating and editing high-quality face assets.
arXiv Detail & Related papers (2025-04-21T17:38:50Z)
Multi-focal Conditioned Latent Diffusion for Person Image Synthesis [59.113899155476005]
The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation. We propose a Multi-focal Conditioned Latent Diffusion (MCLD) method to address these limitations. Our approach utilizes a multi-focal condition aggregation module, which effectively integrates facial identity and texture-specific information.
arXiv Detail & Related papers (2025-03-19T20:50:10Z)
JADE: Joint-aware Latent Diffusion for 3D Human Generative Modeling [62.77347895550087]
We introduce JADE, a generative framework that learns the variations of human shapes with fined-grained control. Our key insight is a joint-aware latent representation that decomposes human bodies into skeleton structures. To generate coherent and plausible human shapes under our proposed decomposition, we also present a cascaded pipeline.
arXiv Detail & Related papers (2024-12-29T14:18:35Z)
SHMT: Self-supervised Hierarchical Makeup Transfer via Latent Diffusion Models [29.430749386234414]
We propose a novel Self-supervised Hierarchical Makeup Transfer (SHMT) method via latent diffusion models. SHMT works in a self-supervised manner, freeing itself from the misguidance of pseudo-paired data. To accommodate a variety of makeup styles, hierarchical texture details are imprecise via a Laplacian pyramid.
arXiv Detail & Related papers (2024-12-15T05:29:07Z)
StyleDiT: A Unified Framework for Diverse Child and Partner Faces Synthesis with Style Latent Diffusion Transformer [11.83733187403255]
StyleDiT is a novel framework that integrates the strengths of StyleGAN with the diffusion model to generate high-quality and diverse kinship faces. We introduce the Trait Guidance (RTG) mechanism, enabling independent control of influencing conditions. We extend the application to an unexplored domain: predicting a partner's facial images using a child's image and one parent's image.
arXiv Detail & Related papers (2024-12-14T10:47:17Z)
Stable Flow: Vital Layers for Training-Free Image Editing [74.52248787189302]
Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT) We propose an automatic method to identify "vital layers" within DiT, crucial for image formation. Next, to enable real-image editing, we introduce an improved image inversion method for flow models.
arXiv Detail & Related papers (2024-11-21T18:59:51Z)
LDFaceNet: Latent Diffusion-based Network for High-Fidelity Deepfake Generation [6.866014367868788]
This paper proposes a novel facial swapping module, termed as LDFaceNet (Latent Diffusion based Face Swapping Network) It is based on a guided latent diffusion model that utilizes facial segmentation and facial recognition modules for a conditioned denoising process. The results of this study demonstrate that the proposed method can generate extremely realistic and coherent images.
arXiv Detail & Related papers (2024-08-04T16:09:04Z)
JoReS-Diff: Joint Retinex and Semantic Priors in Diffusion Model for Low-light Image Enhancement [69.6035373784027]
Low-light image enhancement (LLIE) has achieved promising performance by employing conditional diffusion models. Previous methods may neglect the importance of a sufficient formulation of task-specific condition strategy. We propose JoReS-Diff, a novel approach that incorporates Retinex- and semantic-based priors as the additional pre-processing condition.
arXiv Detail & Related papers (2023-12-20T08:05:57Z)
Multi-View Unsupervised Image Generation with Cross Attention Guidance [23.07929124170851]
This paper introduces a novel pipeline for unsupervised training of a pose-conditioned diffusion model on single-category datasets. We identify object poses by clustering the dataset through comparing visibility and locations of specific object parts. Our model, MIRAGE, surpasses prior work in novel view synthesis on real images.
arXiv Detail & Related papers (2023-12-07T14:55:13Z)
Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models [13.019535928387702]
This paper presents Progressive Conditional Diffusion Models (PCDMs) that incrementally bridge the gap between person images under the target and source poses through three stages. Both qualitative and quantitative results demonstrate the consistency and photorealism of our proposed PCDMs under challenging scenarios.
arXiv Detail & Related papers (2023-10-10T05:13:17Z)
Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis [62.07413805483241]
Steered Diffusion is a framework for zero-shot conditional image generation using a diffusion model trained for unconditional generation. We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution.
arXiv Detail & Related papers (2023-09-30T02:03:22Z)
DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing [94.24479528298252]
DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision. By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images. We present a challenging benchmark dataset called DragBench to evaluate the performance of interactive point-based image editing methods.
arXiv Detail & Related papers (2023-06-26T06:04:09Z)
Multimodal-driven Talking Face Generation via a Unified Diffusion-based Generator [29.58245990622227]
Multimodal-driven talking face generation refers to animating a portrait with the given pose, expression, and gaze transferred from the driving image and video, or estimated from the text and audio. Existing methods ignore the potential of text modal, and their generators mainly follow the source-oriented feature paradigm coupled with unstable GAN frameworks. We derive a novel paradigm free of unstable seesaw-style optimization, resulting in simple, stable, and effective training and inference schemes.
arXiv Detail & Related papers (2023-05-04T07:01:36Z)
Collaborative Diffusion for Multi-Modal Face Generation and Editing [34.16906110777047]
We present Collaborative Diffusion, where pre-trained uni-modal diffusion models collaborate to achieve multi-modal face generation and editing without re-training. Specifically, we propose dynamic diffuser, a meta-network that adaptively hallucinates multi-modal denoising steps by predicting the spatial-temporal influence functions for each pre-trained uni-modal model.
arXiv Detail & Related papers (2023-04-20T17:59:02Z)
Style-Hallucinated Dual Consistency Learning: A Unified Framework for Visual Domain Generalization [113.03189252044773]
We propose a unified framework, Style-HAllucinated Dual consistEncy learning (SHADE), to handle domain shift in various visual tasks. Our versatile SHADE can significantly enhance the generalization in various visual recognition tasks, including image classification, semantic segmentation and object detection.
arXiv Detail & Related papers (2022-12-18T11:42:51Z)
Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks. Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.