SVDiff: Compact Parameter Space for Diffusion Fine-Tuning
- URL: http://arxiv.org/abs/2303.11305v4
- Date: Sun, 2 Jul 2023 21:16:39 GMT
- Title: SVDiff: Compact Parameter Space for Diffusion Fine-Tuning
- Authors: Ligong Han, Yinxiao Li, Han Zhang, Peyman Milanfar, Dimitris Metaxas,
Feng Yang
- Abstract summary: We propose a novel approach to address limitations in existing text-to-image diffusion models for personalization.
Our method involves fine-tuning the singular values of the weight matrices, leading to a compact and efficient parameter space.
We also propose a Cut-Mix-Unmix data-augmentation technique to enhance the quality of multi-subject image generation and a simple text-based image editing framework.
- Score: 19.978410014103435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have achieved remarkable success in text-to-image
generation, enabling the creation of high-quality images from text prompts or
other modalities. However, existing methods for customizing these models are
limited by handling multiple personalized subjects and the risk of overfitting.
Moreover, their large number of parameters is inefficient for model storage. In
this paper, we propose a novel approach to address these limitations in
existing text-to-image diffusion models for personalization. Our method
involves fine-tuning the singular values of the weight matrices, leading to a
compact and efficient parameter space that reduces the risk of overfitting and
language drifting. We also propose a Cut-Mix-Unmix data-augmentation technique
to enhance the quality of multi-subject image generation and a simple
text-based image editing framework. Our proposed SVDiff method has a
significantly smaller model size compared to existing methods (approximately
2,200 times fewer parameters compared with vanilla DreamBooth), making it more
practical for real-world applications.
Related papers
- PaRa: Personalizing Text-to-Image Diffusion via Parameter Rank Reduction [38.424899483761656]
PaRa is an effective and efficient Rank Reduction approach for T2I model personalization.
Our design is motivated by the fact that taming a T2I model toward a novel concept implies a small generation space.
We show that PaRa achieves great advantages over existing finetuning approaches on single/multi-subject generation as well as single-image editing.
arXiv Detail & Related papers (2024-06-09T04:51:51Z) - MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models [34.611309081801345]
Large diffusion-based Text-to-Image (T2I) models have shown impressive generative powers for text-to-image generation.
In this paper, we propose a novel strategy to scale a generative model across new tasks with minimal compute.
arXiv Detail & Related papers (2024-04-15T17:55:56Z) - Many-to-many Image Generation with Auto-regressive Diffusion Models [59.5041405824704]
This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images.
We present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images.
We learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework.
arXiv Detail & Related papers (2024-04-03T23:20:40Z) - MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration [7.087475633143941]
MM-Diff is a tuning-free image personalization framework capable of generating high-fidelity images of both single and multiple subjects in seconds.
MM-Diff employs a vision encoder to transform the input image into CLS and patch embeddings.
CLS embeddings are used on the one hand to augment the text embeddings, and on the other hand together with patch embeddings to derive a small number of detail-rich subject embeddings.
arXiv Detail & Related papers (2024-03-22T09:32:31Z) - DiffuseKronA: A Parameter Efficient Fine-tuning Method for Personalized
Diffusion Models [46.58122934173729]
textbftextitDiffuseKronA is a product-based adaptation module for subject-driven text-to-image (T2I) generative models.
It significantly reduces the parameter count by 35% and 99.947% compared to LoRA-DreamBooth and the original DreamBooth, respectively.
It can achieve up to a 50% reduction with results comparable to LoRA-DreamBooth.
arXiv Detail & Related papers (2024-02-27T11:05:34Z) - Direct Consistency Optimization for Compositional Text-to-Image
Personalization [73.94505688626651]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency.
We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - Matryoshka Diffusion Models [38.26966802461602]
Diffusion models are the de facto approach for generating high-quality images and videos.
We introduce Matryoshka Diffusion Models, an end-to-end framework for high-resolution image and video synthesis.
We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications.
arXiv Detail & Related papers (2023-10-23T17:20:01Z) - Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size
HD Images [56.17404812357676]
Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters composition problems when generating images of varying sizes.
We propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size.
We show that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.
arXiv Detail & Related papers (2023-08-31T09:27:56Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - Continual Diffusion: Continual Customization of Text-to-Image Diffusion with C-LoRA [64.10981296843609]
We show that recent state-of-the-art customization of text-to-image models suffer from catastrophic forgetting when new concepts arrive sequentially.
We propose a new method, C-LoRA, composed of a continually self-regularized low-rank adaptation in cross attention layers of the popular Stable Diffusion model.
We show that C-LoRA not only outperforms several baselines for our proposed setting of text-to-image continual customization, but that we achieve a new state-of-the-art in the well-established rehearsal-free continual learning setting for image classification.
arXiv Detail & Related papers (2023-04-12T17:59:41Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.