Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models
- URL: http://arxiv.org/abs/2410.20155v1
- Date: Sat, 26 Oct 2024 12:00:33 GMT
- Title: Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models
- Authors: Liulei Li, Wenguan Wang, Yi Yang,
- Abstract summary: We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models.
We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space.
These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
- Score: 65.82564074712836
- License:
- Abstract: Prevalent human-object interaction (HOI) detection approaches typically leverage large-scale visual-linguistic models to help recognize events involving humans and objects. Though promising, models trained via contrastive learning on text-image pairs often neglect mid/low-level visual cues and struggle at compositional reasoning. In response, we introduce DIFFUSIONHOI, a new HOI detector shedding light on text-to-image diffusion models. Unlike the aforementioned models, diffusion models excel in discerning mid/low-level visual concepts as generative models, and possess strong compositionality to handle novel concepts expressed in text inputs. Considering diffusion models usually emphasize instance objects, we first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions, and extract HOI-relevant cues from images without heavy fine-tuning. Benefited from above, DIFFUSIONHOI achieves SOTA performance on three datasets under both regular and zero-shot setups.
Related papers
- DiffusionPID: Interpreting Diffusion via Partial Information Decomposition [24.83767778658948]
We apply information-theoretic principles to decompose the input text prompt into its elementary components.
We analyze how individual tokens and their interactions shape the generated image.
We show that PID is a potent tool for evaluating and diagnosing text-to-image diffusion models.
arXiv Detail & Related papers (2024-06-07T18:17:17Z) - Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models [23.786473791344395]
Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process.
We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt.
Experiment results show that our method consistently outperforms other baselines.
arXiv Detail & Related papers (2024-03-11T02:18:27Z) - On the Multi-modal Vulnerability of Diffusion Models [56.08923332178462]
We propose MMP-Attack to manipulate the generation results of diffusion models by appending a specific suffix to the original prompt.
Our goal is to induce diffusion models to generate a specific object while simultaneously eliminating the original object.
arXiv Detail & Related papers (2024-02-02T12:39:49Z) - Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners.
Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - Compositional Visual Generation with Composable Diffusion Models [80.75258849913574]
We propose an alternative structured approach for compositional generation using diffusion models.
An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image.
The proposed method can generate scenes at test time that are substantially more complex than those seen in training.
arXiv Detail & Related papers (2022-06-03T17:47:04Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.