Related papers: Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models

Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models

URL: http://arxiv.org/abs/2410.20155v1
Date: Sat, 26 Oct 2024 12:00:33 GMT
Title: Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models
Authors: Liulei Li, Wenguan Wang, Yi Yang,
Abstract summary: We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models. We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
Score: 65.82564074712836
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prevalent human-object interaction (HOI) detection approaches typically leverage large-scale visual-linguistic models to help recognize events involving humans and objects. Though promising, models trained via contrastive learning on text-image pairs often neglect mid/low-level visual cues and struggle at compositional reasoning. In response, we introduce DIFFUSIONHOI, a new HOI detector shedding light on text-to-image diffusion models. Unlike the aforementioned models, diffusion models excel in discerning mid/low-level visual concepts as generative models, and possess strong compositionality to handle novel concepts expressed in text inputs. Considering diffusion models usually emphasize instance objects, we first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions, and extract HOI-relevant cues from images without heavy fine-tuning. Benefited from above, DIFFUSIONHOI achieves SOTA performance on three datasets under both regular and zero-shot setups.

Related papers

Evaluating the encoding competence of visual language models using uncommon actions [5.816389980109022]
UAIT is a new evaluation benchmark designed to test the semantic understanding ability of visual language models (VLMs) in uncommon-sense action scenes.<n>We synthesize high-quality uncommon-sense image-text samples using large language models, few-shot prompt engineering, and text-to-image generation.<n>We evaluate multiple state-of-the-art visual language models and compare them with models based on contrastive learning.
arXiv Detail & Related papers (2026-01-12T17:15:45Z)
VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness [5.542712070598464]
VerbDiff is a novel text-to-image generation model that weakens the bias between interaction words and objects. Our approach enables the model to accurately understand the intended interaction between humans and objects, producing high-quality images.
arXiv Detail & Related papers (2025-03-20T17:56:20Z)
DiffusionPID: Interpreting Diffusion via Partial Information Decomposition [24.83767778658948]
We apply information-theoretic principles to decompose the input text prompt into its elementary components. We analyze how individual tokens and their interactions shape the generated image. We show that PID is a potent tool for evaluating and diagnosing text-to-image diffusion models.
arXiv Detail & Related papers (2024-06-07T18:17:17Z)
Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models [23.786473791344395]
Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process. We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt. Experiment results show that our method consistently outperforms other baselines.
arXiv Detail & Related papers (2024-03-11T02:18:27Z)
On the Multi-modal Vulnerability of Diffusion Models [56.08923332178462]
We propose MMP-Attack to manipulate the generation results of diffusion models by appending a specific suffix to the original prompt. Our goal is to induce diffusion models to generate a specific object while simultaneously eliminating the original object.
arXiv Detail & Related papers (2024-02-02T12:39:49Z)
Semantic Guidance Tuning for Text-To-Image Diffusion Models [3.3881449308956726]
We propose a training-free approach that modulates the guidance direction of diffusion models during inference. We first decompose the prompt semantics into a set of concepts, and monitor the guidance trajectory in relation to each concept. Based on this observation, we devise a technique to steer the guidance direction towards any concept from which the model diverges.
arXiv Detail & Related papers (2023-12-26T09:02:17Z)
Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z)
Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z)
Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z)
Compositional Visual Generation with Composable Diffusion Models [80.75258849913574]
We propose an alternative structured approach for compositional generation using diffusion models. An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image. The proposed method can generate scenes at test time that are substantially more complex than those seen in training.
arXiv Detail & Related papers (2022-06-03T17:47:04Z)
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research. We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training. Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.