Related papers: InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models

InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models

URL: http://arxiv.org/abs/2312.05849v2
Date: Tue, 27 Feb 2024 02:00:58 GMT
Title: InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models
Authors: Jiun Tian Hoe and Xudong Jiang and Chee Seng Chan and Yap-Peng Tan and Weipeng Hu
Abstract summary: We study the problems of conditioning T2I diffusion models with Human-Object Interaction (HOI) information. We propose a pluggable interaction control model, called InteractDiffusion, that extends existing pre-trained T2I diffusion models. Our model attains the ability to control the interaction and location on existing T2I diffusion models.
Score: 43.62338454684645
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale text-to-image (T2I) diffusion models have showcased incredible capabilities in generating coherent images based on textual descriptions, enabling vast applications in content generation. While recent advancements have introduced control over factors such as object localization, posture, and image contours, a crucial gap remains in our ability to control the interactions between objects in the generated content. Well-controlling interactions in generated images could yield meaningful applications, such as creating realistic scenes with interacting characters. In this work, we study the problems of conditioning T2I diffusion models with Human-Object Interaction (HOI) information, consisting of a triplet label (person, action, object) and corresponding bounding boxes. We propose a pluggable interaction control model, called InteractDiffusion that extends existing pre-trained T2I diffusion models to enable them being better conditioned on interactions. Specifically, we tokenize the HOI information and learn their relationships via interaction embeddings. A conditioning self-attention layer is trained to map HOI tokens to visual tokens, thereby conditioning the visual tokens better in existing T2I diffusion models. Our model attains the ability to control the interaction and location on existing T2I diffusion models, which outperforms existing baselines by a large margin in HOI detection score, as well as fidelity in FID and KID. Project page: https://jiuntian.github.io/interactdiffusion.

Related papers

SDMatte: Grafting Diffusion Models for Interactive Matting [16.575733536011658]
We propose a diffusion-driven interactive matting model, SDMatte, with three key contributions.<n>First, we exploit the powerful priors of diffusion models and transform the text-driven interaction capability into visual prompt-driven interaction capability.<n>Second, we integrate coordinate embeddings of visual prompts and opacity embeddings of target objects into U-Net, enhancing SDMatte's sensitivity to spatial position information.<n>Third, we propose a masked self-attention mechanism that enables the model to focus on areas specified by visual prompts, leading to better performance.
arXiv Detail & Related papers (2025-08-01T09:00:48Z)
Generating Fine Details of Entity Interactions [17.130839907951877]
This paper introduces InterActing, an interaction-focused dataset with 1000 fine-grained prompts covering three key scenarios. We propose a decomposition-augmented refinement procedure to address interaction generation challenges. Our approach, DetailScribe, uses a VLM to critique generated images, and applies targeted interventions within the diffusion process in refinement.
arXiv Detail & Related papers (2025-04-11T17:24:58Z)
VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness [5.542712070598464]
VerbDiff is a novel text-to-image generation model that weakens the bias between interaction words and objects. Our approach enables the model to accurately understand the intended interaction between humans and objects, producing high-quality images.
arXiv Detail & Related papers (2025-03-20T17:56:20Z)
Consistent Human Image and Video Generation with Spatially Conditioned Diffusion [82.4097906779699]
Consistent human-centric image and video synthesis aims to generate images with new poses while preserving appearance consistency with a given reference image. We frame the task as a spatially-conditioned inpainting problem, where the target image is in-painted to maintain appearance consistency with the reference. This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network.
arXiv Detail & Related papers (2024-12-19T05:02:30Z)
Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models. We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z)
DreamMover: Leveraging the Prior of Diffusion Models for Image Interpolation with Large Motion [35.60459492849359]
We study the problem of generating intermediate images from image pairs with large motion. Due to the large motion, the intermediate semantic information may be absent in input images. We propose DreamMover, a novel image framework with three main components.
arXiv Detail & Related papers (2024-09-15T04:09:12Z)
Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence. We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z)
AID: Attention Interpolation of Text-to-Image Diffusion [64.87754163416241]
We introduce a training-free technique named Attention Interpolation via Diffusion (AID) AID fuses the interpolated attention with self-attention to boost fidelity. We also present a variant, Conditional-guided Attention Interpolation via Diffusion (AID), that considers as a condition-dependent generative process.
arXiv Detail & Related papers (2024-03-26T17:57:05Z)
Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation [21.739328335601716]
This paper focuses on inserting accurate and interactive ID embedding into the Stable Diffusion Model for personalized generation. We propose a face-wise attention loss to fit the face region instead of entangling ID-unrelated information, such as face layout and background. Our results exhibit superior ID accuracy, text-based manipulation ability, and generalization compared to previous methods.
arXiv Detail & Related papers (2024-01-31T11:52:33Z)
DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing [94.24479528298252]
DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision. By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images. We present a challenging benchmark dataset called DragBench to evaluate the performance of interactive point-based image editing methods.
arXiv Detail & Related papers (2023-06-26T06:04:09Z)
Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis [59.10787643285506]
Diffusion-based models have achieved state-of-the-art performance on text-to-image synthesis tasks. One critical limitation of these models is the low fidelity of generated images with respect to the text description. We propose a new text-to-image algorithm that adds explicit control over spatial-temporal cross-attention in diffusion models.
arXiv Detail & Related papers (2023-04-07T23:49:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.