FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement
- URL: http://arxiv.org/abs/2504.03292v1
- Date: Fri, 04 Apr 2025 09:17:57 GMT
- Title: FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement
- Authors: Gia-Nghia Tran, Quang-Huy Che, Trong-Tai Dam Vu, Bich-Nga Pham, Vinh-Tiep Nguyen, Trung-Nghia Le, Minh-Triet Tran,
- Abstract summary: Current methods often overfit when trained on a small number of samples.<n>FaR is a novel approach that tackles these challenges through two key contributions: Concept Fusion technique and Localized Refinement loss function.<n> Empirical results show that FaR not only prevents overfitting and attribute leakage while maintaining photorealism, but also outperforms other state-of-the-art methods.
- Score: 9.25127189426397
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating multiple new concepts remains a challenging problem in the text-to-image task. Current methods often overfit when trained on a small number of samples and struggle with attribute leakage, particularly for class-similar subjects (e.g., two specific dogs). In this paper, we introduce Fuse-and-Refine (FaR), a novel approach that tackles these challenges through two key contributions: Concept Fusion technique and Localized Refinement loss function. Concept Fusion systematically augments the training data by separating reference subjects from backgrounds and recombining them into composite images to increase diversity. This augmentation technique tackles the overfitting problem by mitigating the narrow distribution of the limited training samples. In addition, Localized Refinement loss function is introduced to preserve subject representative attributes by aligning each concept's attention map to its correct region. This approach effectively prevents attribute leakage by ensuring that the diffusion model distinguishes similar subjects without mixing their attention maps during the denoising process. By fine-tuning specific modules at the same time, FaR balances the learning of new concepts with the retention of previously learned knowledge. Empirical results show that FaR not only prevents overfitting and attribute leakage while maintaining photorealism, but also outperforms other state-of-the-art methods.
Related papers
- Sculpting Memory: Multi-Concept Forgetting in Diffusion Models via Dynamic Mask and Concept-Aware Optimization [20.783312940122297]
Text-to-image (T2I) diffusion models have achieved remarkable success in generating high-quality images from textual prompts.
However, their ability to store vast amounts of knowledge raises concerns in scenarios where selective forgetting is necessary.
We propose textbfDynamic Mask coupled with Concept-Aware Loss, a novel unlearning framework designed for multi-concept forgetting.
arXiv Detail & Related papers (2025-04-12T01:38:58Z) - Improving Multi-Subject Consistency in Open-Domain Image Generation with Isolation and Reposition Attention [25.69140705877128]
We propose a training-free diffusion model with Isolation and Reposition Attention, named IR-Diffusion.<n>Isolation Attention ensures that multiple subjects in the target image do not reference each other, effectively eliminating the subject convergence.<n>Reposition Attention involves scaling and repositioning subjects in both reference and target images to the same position within the images.
arXiv Detail & Related papers (2024-11-28T16:50:30Z) - Towards Lifelong Few-Shot Customization of Text-to-Image Diffusion [50.26583654615212]
Lifelong few-shot customization for text-to-image diffusion aims to continually generalize existing models for new tasks with minimal data.
In this study, we identify and categorize the catastrophic forgetting problems into two folds: relevant concepts forgetting and previous concepts forgetting.
Unlike existing methods that rely on additional real data or offline replay of original concept data, our approach enables on-the-fly knowledge distillation to retain the previous concepts while learning new ones.
arXiv Detail & Related papers (2024-11-08T12:58:48Z) - Training-free Composite Scene Generation for Layout-to-Image Synthesis [29.186425845897947]
This paper introduces a novel training-free approach designed to overcome adversarial semantic intersections during the diffusion conditioning phase.
We propose two innovative constraints: 1) an inter-token constraint that resolves token conflicts to ensure accurate concept synthesis; and 2) a self-attention constraint that improves pixel-to-pixel relationships.
Our evaluations confirm the effectiveness of leveraging layout information for guiding the diffusion process, generating content-rich images with enhanced fidelity and complexity.
arXiv Detail & Related papers (2024-07-18T15:48:07Z) - Denoising as Adaptation: Noise-Space Domain Adaptation for Image Restoration [64.84134880709625]
We show that it is possible to perform domain adaptation via the noise space using diffusion models.<n>In particular, by leveraging the unique property of how auxiliary conditional inputs influence the multi-step denoising process, we derive a meaningful diffusion loss.<n>We present crucial strategies such as channel-shuffling layer and residual-swapping contrastive learning in the diffusion model.
arXiv Detail & Related papers (2024-06-26T17:40:30Z) - Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation [60.943159830780154]
We introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process.
We demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
arXiv Detail & Related papers (2024-03-25T17:52:07Z) - Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention [62.671435607043875]
Research indicates that text-to-image diffusion models replicate images from their training data, raising tremendous concerns about potential copyright infringement and privacy risks.
We reveal that during memorization, the cross-attention tends to focus disproportionately on the embeddings of specific tokens.
We introduce an innovative approach to detect and mitigate memorization in diffusion models.
arXiv Detail & Related papers (2024-03-17T01:27:00Z) - Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing
Else [75.6806649860538]
We consider a more ambitious goal: natural multi-concept generation using a pre-trained diffusion model.
We observe concept dominance and non-localized contribution that severely degrade multi-concept generation performance.
We design a minimal low-cost solution that overcomes the above issues by tweaking the text embeddings for more realistic multi-concept text-to-image generation.
arXiv Detail & Related papers (2023-10-11T12:05:44Z) - A-STAR: Test-time Attention Segregation and Retention for Text-to-image
Synthesis [24.159726798004748]
We present two test-time attention-based loss functions for text-to-image generative models.
First, our attention segregation loss reduces the cross-attention overlap between attention maps of different concepts in the text prompt.
Second, our attention retention loss explicitly forces text-to-image diffusion models to retain cross-attention information for all concepts across all denoising time steps.
arXiv Detail & Related papers (2023-06-26T09:34:10Z) - CoCoNet: Coupled Contrastive Learning Network with Multi-level Feature Ensemble for Multi-modality Image Fusion [68.78897015832113]
We propose a coupled contrastive learning network, dubbed CoCoNet, to realize infrared and visible image fusion.<n>Our method achieves state-of-the-art (SOTA) performance under both subjective and objective evaluation.
arXiv Detail & Related papers (2022-11-20T12:02:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.