T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for
Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2302.08453v2
- Date: Mon, 20 Mar 2023 10:52:26 GMT
- Title: T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for
Text-to-Image Diffusion Models
- Authors: Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang
Qi, Ying Shan, Xiaohu Qie
- Abstract summary: We learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals.
Our T2I-Adapter has promising generation quality and a wide range of applications.
- Score: 29.280739915676737
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The incredible generative ability of large-scale text-to-image (T2I) models
has demonstrated strong power of learning complex structures and meaningful
semantics. However, relying solely on text prompts cannot fully take advantage
of the knowledge learned by the model, especially when flexible and accurate
controlling (e.g., color and structure) is needed. In this paper, we aim to
``dig out" the capabilities that T2I models have implicitly learned, and then
explicitly use them to control the generation more granularly. Specifically, we
propose to learn simple and lightweight T2I-Adapters to align internal
knowledge in T2I models with external control signals, while freezing the
original large T2I models. In this way, we can train various adapters according
to different conditions, achieving rich control and editing effects in the
color and structure of the generation results. Further, the proposed
T2I-Adapters have attractive properties of practical value, such as
composability and generalization ability. Extensive experiments demonstrate
that our T2I-Adapter has promising generation quality and a wide range of
applications.
Related papers
- ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement [49.513401043490305]
This work explores the continual general pre-training of text-to-video models.
We break this task into two key aspects: increasing model capacity and improving semantic understanding.
For semantic understanding, we propose a method that leverages large language models as advanced text encoders.
arXiv Detail & Related papers (2024-12-25T18:58:07Z) - PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation [4.98706730396778]
We present PhyT2V, a new data-independent T2V technique that expands the current T2V model's capability of video generation to out-of-distribution domains.
Our experiments show that PhyT2V improves existing T2V models' adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers.
arXiv Detail & Related papers (2024-11-30T22:02:12Z) - TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On [78.33688031340698]
TED-VITON is a novel framework that integrates a Garment Semantic (GS) Adapter for enhancing garment-specific features.
These innovations enable state-of-the-art (SOTA) performance in visual quality and text fidelity.
arXiv Detail & Related papers (2024-11-26T01:00:09Z) - SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with
Auto-Generated Data [73.23388142296535]
SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets.
We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks.
We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
arXiv Detail & Related papers (2024-03-11T17:35:33Z) - NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image [45.34977005820166]
NVS-Adapter is a plug-and-play module for a Text-to-Image (T2I) model.
It synthesizes novel multi-views of visual objects while fully exploiting the generalization capacity of T2I models.
Experimental results demonstrate that the NVS-Adapter can effectively synthesize geometrically consistent multi-views.
arXiv Detail & Related papers (2023-12-12T14:29:57Z) - Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation [115.63085345822175]
We introduce Idea to Image'', a system that enables multimodal iterative self-refinement with GPT-4V(ision) for automatic image design and generation.
We investigate if systems based on large multimodal models (LMMs) can develop analogous multimodal self-refinement abilities.
arXiv Detail & Related papers (2023-10-12T17:34:20Z) - Mini-DALLE3: Interactive Text to Image by Prompting Large Language
Models [71.49054220807983]
A prevalent limitation persists in the effective communication with T2I models, such as Stable Diffusion, using natural language descriptions.
Inspired by the recently released DALLE3, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I)
We present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models.
arXiv Detail & Related papers (2023-10-11T16:53:40Z) - SimDA: Simple Diffusion Adapter for Efficient Video Generation [102.90154301044095]
We propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way.
In addition to T2V generation in the wild, SimDA could also be utilized in one-shot video editing with only 2 minutes tuning.
arXiv Detail & Related papers (2023-08-18T17:58:44Z) - Transformer-based Conditional Variational Autoencoder for Controllable
Story Generation [39.577220559911055]
We investigate large-scale latent variable models (LVMs) for neural story generation with objectives in two threads: generation effectiveness and controllability.
We advocate to revive latent variable modeling, essentially the power of representation learning, in the era of Transformers.
Specifically, we integrate latent representation vectors with a Transformer-based pre-trained architecture to build conditional variational autoencoder (CVAE)
arXiv Detail & Related papers (2021-01-04T08:31:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.