InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2312.05849v2
- Date: Tue, 27 Feb 2024 02:00:58 GMT
- Title: InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models
- Authors: Jiun Tian Hoe and Xudong Jiang and Chee Seng Chan and Yap-Peng Tan and
Weipeng Hu
- Abstract summary: We study the problems of conditioning T2I diffusion models with Human-Object Interaction (HOI) information.
We propose a pluggable interaction control model, called InteractDiffusion, that extends existing pre-trained T2I diffusion models.
Our model attains the ability to control the interaction and location on existing T2I diffusion models.
- Score: 43.62338454684645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale text-to-image (T2I) diffusion models have showcased incredible
capabilities in generating coherent images based on textual descriptions,
enabling vast applications in content generation. While recent advancements
have introduced control over factors such as object localization, posture, and
image contours, a crucial gap remains in our ability to control the
interactions between objects in the generated content. Well-controlling
interactions in generated images could yield meaningful applications, such as
creating realistic scenes with interacting characters. In this work, we study
the problems of conditioning T2I diffusion models with Human-Object Interaction
(HOI) information, consisting of a triplet label (person, action, object) and
corresponding bounding boxes. We propose a pluggable interaction control model,
called InteractDiffusion that extends existing pre-trained T2I diffusion models
to enable them being better conditioned on interactions. Specifically, we
tokenize the HOI information and learn their relationships via interaction
embeddings. A conditioning self-attention layer is trained to map HOI tokens to
visual tokens, thereby conditioning the visual tokens better in existing T2I
diffusion models. Our model attains the ability to control the interaction and
location on existing T2I diffusion models, which outperforms existing baselines
by a large margin in HOI detection score, as well as fidelity in FID and KID.
Project page: https://jiuntian.github.io/interactdiffusion.
Related papers
- ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer [95.80384464922147]
Continuous visual generation requires the full-sequence diffusion-based approach.
We present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer.
We demonstrate that ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective.
arXiv Detail & Related papers (2024-12-10T18:13:20Z) - Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z) - DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion [35.60459492849359]
We study the problem of generating intermediate images from image pairs with large motion.
Due to the large motion, the intermediate semantic information may be absent in input images.
We propose DreamMover, a novel image framework with three main components.
arXiv Detail & Related papers (2024-09-15T04:09:12Z) - Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z) - AID: Attention Interpolation of Text-to-Image Diffusion [64.87754163416241]
We introduce a training-free technique named Attention Interpolation via Diffusion (AID)
AID fuses the interpolated attention with self-attention to boost fidelity.
We also present a variant, Conditional-guided Attention Interpolation via Diffusion (AID), that considers as a condition-dependent generative process.
arXiv Detail & Related papers (2024-03-26T17:57:05Z) - Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation [21.739328335601716]
This paper focuses on inserting accurate and interactive ID embedding into the Stable Diffusion Model for personalized generation.
We propose a face-wise attention loss to fit the face region instead of entangling ID-unrelated information, such as face layout and background.
Our results exhibit superior ID accuracy, text-based manipulation ability, and generalization compared to previous methods.
arXiv Detail & Related papers (2024-01-31T11:52:33Z) - DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing [94.24479528298252]
DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision.
By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images.
We present a challenging benchmark dataset called DragBench to evaluate the performance of interactive point-based image editing methods.
arXiv Detail & Related papers (2023-06-26T06:04:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.