Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder
- URL: http://arxiv.org/abs/2503.11937v2
- Date: Tue, 01 Apr 2025 13:42:51 GMT
- Title: Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder
- Authors: Wonwoong Cho, Yan-Ying Chen, Matthew Klenk, David I. Inouye, Yanxia Zhang,
- Abstract summary: We introduce the Attribute (Att) Adapter, a novel plug-and-play module designed to enable fine-grained, multi-attributes control in pretrained diffusion models.<n>Att-Adapter is flexible, requiring no paired synthetic data for training, and is easily scalable to multiple attributes within a single model.
- Score: 11.392007197036525
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-Image (T2I) Diffusion Models have achieved remarkable performance in generating high quality images. However, enabling precise control of continuous attributes, especially multiple attributes simultaneously, in a new domain (e.g., numeric values like eye openness or car width) with text-only guidance remains a significant challenge. To address this, we introduce the Attribute (Att) Adapter, a novel plug-and-play module designed to enable fine-grained, multi-attributes control in pretrained diffusion models. Our approach learns a single control adapter from a set of sample images that can be unpaired and contain multiple visual attributes. The Att-Adapter leverages the decoupled cross attention module to naturally harmonize the multiple domain attributes with text conditioning. We further introduce Conditional Variational Autoencoder (CVAE) to the Att-Adapter to mitigate overfitting, matching the diverse nature of the visual world. Evaluations on two public datasets show that Att-Adapter outperforms all LoRA-based baselines in controlling continuous attributes. Additionally, our method enables a broader control range and also improves disentanglement across multiple attributes, surpassing StyleGAN-based techniques. Notably, Att-Adapter is flexible, requiring no paired synthetic data for training, and is easily scalable to multiple attributes within a single model.
Related papers
- MV-Adapter: Multi-view Consistent Image Generation Made Easy [60.93957644923608]
Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image models.
We present the first adapter for multi-view image generation, and MVAdapter, a versatile plug-and-play adapter.
arXiv Detail & Related papers (2024-12-04T18:48:20Z) - DynamicControl: Adaptive Condition Selection for Improved Text-to-Image Generation [63.63429658282696]
We propose DynamicControl, which supports dynamic combinations of diverse control signals.
We show that DynamicControl is superior to existing methods in terms of controllability, generation quality and composability under various conditional controls.
arXiv Detail & Related papers (2024-12-04T11:54:57Z) - OminiControl: Minimal and Universal Control for Diffusion Transformer [68.3243031301164]
We present OminiControl, a novel approach that rethinks how image conditions are integrated into Diffusion Transformer (DiT) architectures.<n>OminiControl addresses these limitations through three key innovations.
arXiv Detail & Related papers (2024-11-22T17:55:15Z) - Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z) - Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions [20.351245266660378]
Recent advances in text-to-image (T2I) diffusion models have significantly improved the quality of generated images.<n>Providing efficient control over individual subjects, particularly the attributes characterizing them, remains a key challenge.<n>No current approach offers both simultaneously, resulting in a gap when trying to achieve precise continuous and subject-specific attribute modulation.
arXiv Detail & Related papers (2024-03-25T18:00:42Z) - DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with
Competitive Query Selection and Adaptive Feature Fusion [82.2425759608975]
Infrared-visible object detection aims to achieve robust even full-day object detection by fusing the complementary information of infrared and visible images.
We propose a Dynamic Adaptive Multispectral Detection Transformer (DAMSDet) to address these two challenges.
Experiments on four public datasets demonstrate significant improvements compared to other state-of-the-art methods.
arXiv Detail & Related papers (2024-03-01T07:03:27Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Adapter-TST: A Parameter Efficient Method for Multiple-Attribute Text
Style Transfer [29.67331801326995]
AdapterTST is a framework that freezes the pre-trained model's original parameters and enables the development of a multiple-attribute text style transfer model.
We evaluate the proposed model on both traditional sentiment transfer and multiple-attribute transfer tasks.
arXiv Detail & Related papers (2023-05-10T07:33:36Z) - Progressive Open-Domain Response Generation with Multiple Controllable
Attributes [13.599621571488033]
We propose a Progressively trained Hierarchical Vari-Decoder (PHED) to tackle this task.
PHED deploys Conditional AutoEncoder (CVAE) on Transformer to include one aspect of attributes at one stage.
PHED significantly outperforms the state-of-the-art neural generation models and produces more diverse responses as expected.
arXiv Detail & Related papers (2021-06-07T08:48:39Z) - MU-GAN: Facial Attribute Editing based on Multi-attention Mechanism [12.762892831902349]
We propose a Multi-attention U-Net-based Generative Adversarial Network (MU-GAN)
First, we replace a classic convolutional encoder-decoder with a symmetric U-Net-like structure in a generator.
Second, a self-attention mechanism is incorporated into convolutional layers for modeling long-range and multi-level dependencies.
arXiv Detail & Related papers (2020-09-09T09:25:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.