Related papers: Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

URL: http://arxiv.org/abs/2410.00700v2
Date: Wed, 2 Oct 2024 06:13:56 GMT
Title: Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models
Authors: Saurav Jha, Shiqi Yang, Masato Ishii, Mengjie Zhao, Christian Simon, Muhammad Jehanzeb Mirza, Dong Gong, Lina Yao, Shusuke Takahashi, Yuki Mitsufuji,
Abstract summary: In the real world, a user may wish to personalize a model on multiple concepts but one at a time. Most personalization methods fail to find a balance between acquiring new concepts and retaining previous ones. We propose regularizing the parameter-space and function-space of text-to-image diffusion models.
Score: 39.46152582128077
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Personalized text-to-image diffusion models have grown popular for their ability to efficiently acquire a new concept from user-defined text descriptions and a few images. However, in the real world, a user may wish to personalize a model on multiple concepts but one at a time, with no access to the data from previous concepts due to storage/privacy concerns. When faced with this continual learning (CL) setup, most personalization methods fail to find a balance between acquiring new concepts and retaining previous ones -- a challenge that continual personalization (CP) aims to solve. Inspired by the successful CL methods that rely on class-specific information for regularization, we resort to the inherent class-conditioned density estimates, also known as diffusion classifier (DC) scores, for continual personalization of text-to-image diffusion models. Namely, we propose using DC scores for regularizing the parameter-space and function-space of text-to-image diffusion models, to achieve continual personalization. Using several diverse evaluation setups, datasets, and metrics, we show that our proposed regularization-based CP methods outperform the state-of-the-art C-LoRA, and other baselines. Finally, by operating in the replay-free CL setup and on low-rank adapters, our method incurs zero storage and parameter overhead, respectively, over the state-of-the-art.

Related papers

Regularized Personalization of Text-to-Image Diffusion Models without Distributional Drift [5.608240462042483]
Personalization using text-to-image diffusion models involves adapting a pretrained model to novel subjects with only a few image examples.<n>Forgetting denotes unintended distributional drift, where the model's output distribution deviates from that of the original pretrained model.<n>We propose a new training objective based on a Lipschitz-bounded formulation that explicitly constrains deviation from the pretrained distribution.
arXiv Detail & Related papers (2025-05-26T05:03:59Z)
Hollowed Net for On-Device Personalization of Text-to-Image Diffusion Models [51.3915762595891]
This paper presents an efficient LoRA-based personalization approach for on-device subject-driven generation. Our method, termed Hollowed Net, enhances memory efficiency during fine-tuning by modifying the architecture of a diffusion U-Net.
arXiv Detail & Related papers (2024-11-02T08:42:48Z)
Textual Localization: Decomposing Multi-concept Images for Subject-Driven Text-to-Image Generation [5.107886283951882]
We introduce a localized text-to-image model to handle multi-concept input images. Our method incorporates a novel cross-attention guidance to decompose multiple concepts. Notably, our method generates cross-attention maps consistent with the target concept in the generated images.
arXiv Detail & Related papers (2024-02-15T14:19:42Z)
Continual Diffusion with STAMINA: STack-And-Mask INcremental Adapters [67.28751868277611]
Recent work has demonstrated ability to customize text-to-image diffusion models to multiple, fine-grained concepts in a sequential manner. We show that capacity to learn new tasks reaches saturation over longer sequences. We introduce a novel method, STack-And-Mask INcremental Adapters (STAMINA), which is composed of low-ranked attention-masked adapters and customized tokens.
arXiv Detail & Related papers (2023-11-30T18:04:21Z)
CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization [56.892032386104006]
CatVersion is an inversion-based method that learns the personalized concept through a handful of examples. Users can utilize text prompts to generate images that embody the personalized concept.
arXiv Detail & Related papers (2023-11-24T17:55:10Z)
Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models [59.094601993993535]
Text-to-image (T2I) personalization allows users to combine their own visual concepts in natural language prompts. Most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. We propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts.
arXiv Detail & Related papers (2023-07-13T17:46:42Z)
Continual Diffusion: Continual Customization of Text-to-Image Diffusion with C-LoRA [64.10981296843609]
We show that recent state-of-the-art customization of text-to-image models suffer from catastrophic forgetting when new concepts arrive sequentially. We propose a new method, C-LoRA, composed of a continually self-regularized low-rank adaptation in cross attention layers of the popular Stable Diffusion model. We show that C-LoRA not only outperforms several baselines for our proposed setting of text-to-image continual customization, but that we achieve a new state-of-the-art in the well-established rehearsal-free continual learning setting for image classification.
arXiv Detail & Related papers (2023-04-12T17:59:41Z)
Designing an Encoder for Fast Personalization of Text-to-Image Models [57.62449900121022]
We propose an encoder-based domain-tuning approach for text-to-image personalization. We employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain. Second, a set of regularized weight-offsets for the text-to-image model that learn how to effectively ingest additional concepts.
arXiv Detail & Related papers (2023-02-23T18:46:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.