Re-mine, Learn and Reason: Exploring the Cross-modal Semantic
Correlations for Language-guided HOI detection
- URL: http://arxiv.org/abs/2307.13529v2
- Date: Mon, 18 Sep 2023 09:28:46 GMT
- Title: Re-mine, Learn and Reason: Exploring the Cross-modal Semantic
Correlations for Language-guided HOI detection
- Authors: Yichao Cao, Qingfei Tang, Feng Yang, Xiu Su, Shan You, Xiaobo Lu and
Chang Xu
- Abstract summary: Human-Object Interaction (HOI) detection is a challenging computer vision task.
We present a framework that enhances HOI detection by incorporating structured text knowledge.
- Score: 57.13665112065285
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-Object Interaction (HOI) detection is a challenging computer vision
task that requires visual models to address the complex interactive
relationship between humans and objects and predict HOI triplets. Despite the
challenges posed by the numerous interaction combinations, they also offer
opportunities for multimodal learning of visual texts. In this paper, we
present a systematic and unified framework (RmLR) that enhances HOI detection
by incorporating structured text knowledge. Firstly, we qualitatively and
quantitatively analyze the loss of interaction information in the two-stage HOI
detector and propose a re-mining strategy to generate more comprehensive visual
representation.Secondly, we design more fine-grained sentence- and word-level
alignment and knowledge transfer strategies to effectively address the
many-to-many matching problem between multiple interactions and multiple
texts.These strategies alleviate the matching confusion problem that arises
when multiple interactions occur simultaneously, thereby improving the
effectiveness of the alignment process. Finally, HOI reasoning by visual
features augmented with textual knowledge substantially improves the
understanding of interactions. Experimental results illustrate the
effectiveness of our approach, where state-of-the-art performance is achieved
on public benchmarks. We further analyze the effects of different components of
our approach to provide insights into its efficacy.
Related papers
- Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - Visualizing Dialogues: Enhancing Image Selection through Dialogue Understanding with Large Language Models [25.070424546200293]
We present a novel approach leveraging the robust reasoning capabilities of large language models (LLMs) to generate precise dialogue-associated visual descriptors.
Experiments conducted on benchmark data validate the effectiveness of our proposed approach in deriving concise and accurate visual descriptors.
Our findings demonstrate the method's generalizability across diverse visual cues, various LLMs, and different datasets.
arXiv Detail & Related papers (2024-07-04T03:50:30Z) - CoSD: Collaborative Stance Detection with Contrastive Heterogeneous Topic Graph Learning [18.75039816544345]
We present a novel collaborative stance detection framework called (CoSD)
CoSD learns topic-aware semantics and collaborative signals among texts, topics, and stance labels.
Experiments on two benchmark datasets demonstrate the state-of-the-art detection performance of CoSD.
arXiv Detail & Related papers (2024-04-26T02:04:05Z) - AntEval: Evaluation of Social Interaction Competencies in LLM-Driven
Agents [65.16893197330589]
Large Language Models (LLMs) have demonstrated their ability to replicate human behaviors across a wide range of scenarios.
However, their capability in handling complex, multi-character social interactions has yet to be fully explored.
We introduce the Multi-Agent Interaction Evaluation Framework (AntEval), encompassing a novel interaction framework and evaluation methods.
arXiv Detail & Related papers (2024-01-12T11:18:00Z) - DER-GCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition [14.639340916340801]
We propose a novel Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition (DER-GCN) method.
It models dialogue relations between speakers and captures latent event relations information.
We conduct extensive experiments on the IEMOCAP and MELD benchmark datasets, which verify the effectiveness of the DER-GCN model.
arXiv Detail & Related papers (2023-12-17T01:49:40Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection.
We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z) - Compositional Learning in Transformer-Based Human-Object Interaction
Detection [6.630793383852106]
Long-tailed distribution of labeled instances is a primary challenge in HOI detection.
Inspired by the nature of HOI triplets, some existing approaches adopt the idea of compositional learning.
We creatively propose a transformer-based framework for compositional HOI learning.
arXiv Detail & Related papers (2023-08-11T06:41:20Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.