InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2312.05849v2
- Date: Tue, 27 Feb 2024 02:00:58 GMT
- Title: InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models
- Authors: Jiun Tian Hoe and Xudong Jiang and Chee Seng Chan and Yap-Peng Tan and
Weipeng Hu
- Abstract summary: We study the problems of conditioning T2I diffusion models with Human-Object Interaction (HOI) information.
We propose a pluggable interaction control model, called InteractDiffusion, that extends existing pre-trained T2I diffusion models.
Our model attains the ability to control the interaction and location on existing T2I diffusion models.
- Score: 43.62338454684645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale text-to-image (T2I) diffusion models have showcased incredible
capabilities in generating coherent images based on textual descriptions,
enabling vast applications in content generation. While recent advancements
have introduced control over factors such as object localization, posture, and
image contours, a crucial gap remains in our ability to control the
interactions between objects in the generated content. Well-controlling
interactions in generated images could yield meaningful applications, such as
creating realistic scenes with interacting characters. In this work, we study
the problems of conditioning T2I diffusion models with Human-Object Interaction
(HOI) information, consisting of a triplet label (person, action, object) and
corresponding bounding boxes. We propose a pluggable interaction control model,
called InteractDiffusion that extends existing pre-trained T2I diffusion models
to enable them being better conditioned on interactions. Specifically, we
tokenize the HOI information and learn their relationships via interaction
embeddings. A conditioning self-attention layer is trained to map HOI tokens to
visual tokens, thereby conditioning the visual tokens better in existing T2I
diffusion models. Our model attains the ability to control the interaction and
location on existing T2I diffusion models, which outperforms existing baselines
by a large margin in HOI detection score, as well as fidelity in FID and KID.
Project page: https://jiuntian.github.io/interactdiffusion.
Related papers
- Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z) - LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? [10.72249123249003]
We revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding.
We introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions.
LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS dataset with 38.2 BLEU@4 and 126.2 CIDEr.
arXiv Detail & Related papers (2024-04-16T17:47:16Z) - AID: Attention Interpolation of Text-to-Image Diffusion [64.87754163416241]
We introduce a training-free technique named Attention Interpolation via Diffusion (AID)
AID fuses the interpolated attention with self-attention to boost fidelity.
We also present a variant, Conditional-guided Attention Interpolation via Diffusion (AID), that considers as a condition-dependent generative process.
arXiv Detail & Related papers (2024-03-26T17:57:05Z) - Box It to Bind It: Unified Layout Control and Attribute Binding in T2I
Diffusion Models [28.278822620442774]
Box-it-to-Bind-it (B2B) is a training-free approach for improving spatial control and semantic accuracy in text-to-image (T2I) diffusion models.
B2B targets three key challenges in T2I: catastrophic neglect, attribute binding, and layout guidance.
B2B is designed as a compatible plug-and-play module for existing T2I models.
arXiv Detail & Related papers (2024-02-27T21:51:32Z) - Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation [21.739328335601716]
This paper focuses on inserting accurate and interactive ID embedding into the Stable Diffusion Model for personalized generation.
We propose a face-wise attention loss to fit the face region instead of entangling ID-unrelated information, such as face layout and background.
Our results exhibit superior ID accuracy, text-based manipulation ability, and generalization compared to previous methods.
arXiv Detail & Related papers (2024-01-31T11:52:33Z) - DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing [94.24479528298252]
DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision.
By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images.
We present a challenging benchmark dataset called DragBench to evaluate the performance of interactive point-based image editing methods.
arXiv Detail & Related papers (2023-06-26T06:04:09Z) - BLIP-Diffusion: Pre-trained Subject Representation for Controllable
Text-to-Image Generation and Editing [73.74570290836152]
BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control.
Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
arXiv Detail & Related papers (2023-05-24T04:51:04Z) - Harnessing the Spatial-Temporal Attention of Diffusion Models for
High-Fidelity Text-to-Image Synthesis [59.10787643285506]
Diffusion-based models have achieved state-of-the-art performance on text-to-image synthesis tasks.
One critical limitation of these models is the low fidelity of generated images with respect to the text description.
We propose a new text-to-image algorithm that adds explicit control over spatial-temporal cross-attention in diffusion models.
arXiv Detail & Related papers (2023-04-07T23:49:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.