Related papers: Compositional Learning in Transformer-Based Human-Object Interaction Detection

Compositional Learning in Transformer-Based Human-Object Interaction Detection

URL: http://arxiv.org/abs/2308.05961v1
Date: Fri, 11 Aug 2023 06:41:20 GMT
Title: Compositional Learning in Transformer-Based Human-Object Interaction Detection
Authors: Zikun Zhuang, Ruihao Qian, Chi Xie, Shuang Liang
Abstract summary: Long-tailed distribution of labeled instances is a primary challenge in HOI detection. Inspired by the nature of HOI triplets, some existing approaches adopt the idea of compositional learning. We creatively propose a transformer-based framework for compositional HOI learning.
Score: 6.630793383852106
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human-object interaction (HOI) detection is an important part of understanding human activities and visual scenes. The long-tailed distribution of labeled instances is a primary challenge in HOI detection, promoting research in few-shot and zero-shot learning. Inspired by the combinatorial nature of HOI triplets, some existing approaches adopt the idea of compositional learning, in which object and action features are learned individually and re-composed as new training samples. However, these methods follow the CNN-based two-stage paradigm with limited feature extraction ability, and often rely on auxiliary information for better performance. Without introducing any additional information, we creatively propose a transformer-based framework for compositional HOI learning. Human-object pair representations and interaction representations are re-composed across different HOI instances, which involves richer contextual information and promotes the generalization of knowledge. Experiments show our simple but effective method achieves state-of-the-art performance, especially on rare HOI classes.

Related papers

Contextualized Representation Learning for Effective Human-Object Interaction Detection [17.242400169885453]
Human-Object Interaction (HOI) detection aims to simultaneously localize human-object pairs and recognize their interactions.<n>We introduce a Contextualized Representation Learning that integrates both affordance-guided reasoning and contextual prompts.<n>Our proposed method demonstrates superior performance on both the HICO-Det and V-COCO datasets in most scenarios.
arXiv Detail & Related papers (2025-09-16T08:03:16Z)
Improvement of Human-Object Interaction Action Recognition Using Scene Information and Multi-Task Learning Approach [0.0]
We propose a methodology to utilize human action recognition performance by considering fixed object information in the environment.<n>The multi-task learning approach, along with interaction area information, succeeds in recognizing the studied interaction and non-interaction actions with an accuracy of 99.25%.
arXiv Detail & Related papers (2025-09-11T00:14:56Z)
Selective Contrastive Learning for Weakly Supervised Affordance Grounding [21.34662128701812]
Weakly supervised affordance grounding seeks to imitate human learning from third-person demonstrations.<n>We introduce selective prototypical and pixel contrastive objectives that adaptively learn affordance-relevant cues at both the part and object levels.
arXiv Detail & Related papers (2025-08-11T11:49:37Z)
Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection [51.52749744031413]
Human-Object Interaction (HOI) detection aims to identify humans and objects within images and interpret their interactions.<n>Existing HOI methods rely heavily on large datasets with manual annotations to learn interactions from visual cues.<n>We propose a novel training-free HOI detection framework for Dynamic Scoring with enhanced semantics.
arXiv Detail & Related papers (2025-07-23T12:30:19Z)
Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues. Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z)
The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks [4.971065912401385]
We propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition. Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification. We validate our method on the Charades dataset that includes a majority of object-based actions.
arXiv Detail & Related papers (2024-05-14T15:28:48Z)
Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration [14.678931157058363]
We propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-language model to improve zero-shot HOI detection. We develop an effective additive self-attention mechanism to generate more comprehensive visual representations. Our model outperforms the previous methods in various zero-shot and full-supervised settings.
arXiv Detail & Related papers (2024-03-12T02:07:23Z)
Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding. Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction. Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z)
Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models [55.20626448358655]
This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs) Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image. For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence.
arXiv Detail & Related papers (2023-11-07T08:27:32Z)
Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection [57.13665112065285]
Human-Object Interaction (HOI) detection is a challenging computer vision task. We present a framework that enhances HOI detection by incorporating structured text knowledge.
arXiv Detail & Related papers (2023-07-25T14:20:52Z)
Skeleton-Based Mutually Assisted Interacted Object Localization and Human Action Recognition [111.87412719773889]
We propose a joint learning framework for "interacted object localization" and "human action recognition" based on skeleton data. Our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition.
arXiv Detail & Related papers (2021-10-28T10:09:34Z)
Detecting Human-Object Interaction via Fabricated Compositional Learning [106.37536031160282]
Human-Object Interaction (HOI) detection is a fundamental task for high-level scene understanding. Human has extremely powerful compositional perception ability to cognize rare or unseen HOI samples. We propose Fabricated Compositional Learning (FCL) to address the problem of open long-tailed HOI detection.
arXiv Detail & Related papers (2021-03-15T08:52:56Z)
Transferable Interactiveness Knowledge for Human-Object Interaction Detection [46.89715038756862]
We explore interactiveness knowledge which indicates whether a human and an object interact with each other or not. We found that interactiveness knowledge can be learned across HOI datasets and bridge the gap between diverse HOI category settings. Our core idea is to exploit an interactiveness network to learn the general interactiveness knowledge from multiple HOI datasets.
arXiv Detail & Related papers (2021-01-25T18:21:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.