Related papers: Fusion is all you need: Face Fusion for Customized Identity-Preserving Image Synthesis

Fusion is all you need: Face Fusion for Customized Identity-Preserving Image Synthesis

URL: http://arxiv.org/abs/2409.19111v2
Date: Wed, 2 Oct 2024 07:56:31 GMT
Title: Fusion is all you need: Face Fusion for Customized Identity-Preserving Image Synthesis
Authors: Salaheldin Mohamed, Dong Han, Yong Li,
Abstract summary: Text-to-image (T2I) models have significantly advanced the development of artificial intelligence. However, existing T2I-based methods often struggle to accurately reproduce the appearance of individuals from a reference image. We leverage the pre-trained UNet from Stable Diffusion to incorporate the target face image directly into the generation process.
Score: 7.099258248662009
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Text-to-image (T2I) models have significantly advanced the development of artificial intelligence, enabling the generation of high-quality images in diverse contexts based on specific text prompts. However, existing T2I-based methods often struggle to accurately reproduce the appearance of individuals from a reference image and to create novel representations of those individuals in various settings. To address this, we leverage the pre-trained UNet from Stable Diffusion to incorporate the target face image directly into the generation process. Our approach diverges from prior methods that depend on fixed encoders or static face embeddings, which often fail to bridge encoding gaps. Instead, we capitalize on UNet's sophisticated encoding capabilities to process reference images across multiple scales. By innovatively altering the cross-attention layers of the UNet, we effectively fuse individual identities into the generative process. This strategic integration of facial features across various scales not only enhances the robustness and consistency of the generated images but also facilitates efficient multi-reference and multi-identity generation. Our method sets a new benchmark in identity-preserving image generation, delivering state-of-the-art results in similarity metrics while maintaining prompt alignment.

Related papers

FaceSnap: Enhanced ID-fidelity Network for Tuning-free Portrait Customization [10.500766709949602]
FaceSnap is a novel method that requires only a single reference image to produce consistent results in a single inference stage.<n>New Facial Attribute Mixer can extract comprehensive fused information from both low-level specific features and high-level abstract features.<n>Landmark Predictor maintains reference identity across landmarks with different poses, providing diverse yet detailed spatial control conditions.
arXiv Detail & Related papers (2026-01-31T09:48:48Z)
TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement [87.82338951215131]
TokenAR is a simple but effective token-level enhancement mechanism to address reference identity confusion problem.<n>Instruct Token Injection plays as a role of extra visual feature container to inject detailed and complementary priors for reference tokens.<n>The identity-token disentanglement strategy (ITD) explicitly guides the token representations toward independently representing the features of each identity.
arXiv Detail & Related papers (2025-10-18T03:36:26Z)
A Watermark for Auto-Regressive Image Generation Models [50.599325258178254]
We propose C-reweight, a distortion-free watermarking method explicitly designed for image generation models.<n>C-reweight mitigates retokenization mismatch while preserving image fidelity.
arXiv Detail & Related papers (2025-06-13T00:15:54Z)
Learning Joint ID-Textual Representation for ID-Preserving Image Synthesis [19.869955517856273]
We propose a novel framework for ID-preserving generation using a multi-modal encoding strategy. FaceCLIP learns a joint embedding space for both identity and textual semantics. We then build FaceCLIP-SDXL, an ID-preserving image synthesis pipeline.
arXiv Detail & Related papers (2025-04-19T06:31:07Z)
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt [101.17660804110409]
Text-to-image generation models can create high-quality images from input prompts. They struggle to support the consistent generation of identity-preserving requirements for storytelling. We propose a novel training-free method for consistent text-to-image generation.
arXiv Detail & Related papers (2025-01-23T10:57:22Z)
ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning [57.91881829308395]
Identity-preserving text-to-image generation (ID-T2I) has received significant attention due to its wide range of application scenarios like AI portrait and advertising. We present textbfID-Aligner, a general feedback learning framework to enhance ID-T2I performance.
arXiv Detail & Related papers (2024-04-23T18:41:56Z)
Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm [31.06269858216316]
We propose Infinite-ID, an ID-semantics decoupling paradigm for identity-preserved personalization. We introduce an identity-enhanced training, incorporating an additional image cross-attention module to capture sufficient ID information. We also introduce a feature interaction mechanism that combines a mixed attention module with an AdaIN-mean operation to seamlessly merge the two streams.
arXiv Detail & Related papers (2024-03-18T13:39:53Z)
Masked Generative Story Transformer with Character Guidance and Caption Augmentation [2.1392064955842023]
Story visualization is a challenging generative vision task, that requires both visual quality and consistency between different frames in generated image sequences. Previous approaches either employ some kind of memory mechanism to maintain context throughout an auto-regressive generation of the image sequence, or model the generation of the characters and their background separately. We propose a completely parallel transformer-based approach, relying on Cross-Attention with past and future captions to achieve consistency.
arXiv Detail & Related papers (2024-03-13T13:10:20Z)
InstantID: Zero-shot Identity-Preserving Generation in Seconds [21.04236321562671]
We introduce InstantID, a powerful diffusion model-based solution for ID embedding. Our plug-and-play module adeptly handles image personalization in various styles using just a single facial image. Our work seamlessly integrates with popular pre-trained text-to-image diffusion models like SD1.5 and SDXL.
arXiv Detail & Related papers (2024-01-15T07:50:18Z)
PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization [92.90392834835751]
PortraitBooth is designed for high efficiency, robust identity preservation, and expression-editable text-to-image generation. PortraitBooth eliminates computational overhead and mitigates identity distortion. It incorporates emotion-aware cross-attention control for diverse facial expressions in generated images.
arXiv Detail & Related papers (2023-12-11T13:03:29Z)
When StyleGAN Meets Stable Diffusion: a $\mathscr{W}_+$ Adapter for Personalized Image Generation [60.305112612629465]
Text-to-image diffusion models have excelled in producing diverse, high-quality, and photo-realistic images. We present a novel use of the extended StyleGAN embedding space $mathcalW_+$ to achieve enhanced identity preservation and disentanglement for diffusion models. Our method adeptly generates personalized text-to-image outputs that are not only compatible with prompt descriptions but also amenable to common StyleGAN editing directions.
arXiv Detail & Related papers (2023-11-29T09:05:14Z)
The Chosen One: Consistent Characters in Text-to-Image Diffusion Models [71.15152184631951]
We propose a fully automated solution for consistent character generation with the sole input being a text prompt. Our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods.
arXiv Detail & Related papers (2023-11-16T18:59:51Z)
PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models [19.519789922033034]
PhotoVerse is an innovative methodology that incorporates a dual-branch conditioning mechanism in both text and image domains. After a single training phase, our approach enables generating high-quality images within only a few seconds.
arXiv Detail & Related papers (2023-09-11T19:59:43Z)
T-Person-GAN: Text-to-Person Image Generation with Identity-Consistency and Manifold Mix-Up [16.165889084870116]
We present an end-to-end approach to generate high-resolution person images conditioned on texts only. We develop an effective generative model to produce person images with two novel mechanisms.
arXiv Detail & Related papers (2022-08-18T07:41:02Z)
DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators. We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output. Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)
Fine-grained Image-to-Image Transformation towards Visual Recognition [102.51124181873101]
We aim at transforming an image with a fine-grained category to synthesize new images that preserve the identity of the input image. We adopt a model based on generative adversarial networks to disentangle the identity related and unrelated factors of an image. Experiments on the CompCars and Multi-PIE datasets demonstrate that our model preserves the identity of the generated images much better than the state-of-the-art image-to-image transformation models.
arXiv Detail & Related papers (2020-01-12T05:26:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.