GEN-VLKT: Simplify Association and Enhance Interaction Understanding for
HOI Detection
- URL: http://arxiv.org/abs/2203.13954v1
- Date: Sat, 26 Mar 2022 01:04:13 GMT
- Title: GEN-VLKT: Simplify Association and Enhance Interaction Understanding for
HOI Detection
- Authors: Yue Liao, Aixi Zhang, Miao Lu, Yongliang Wang, Xiaobo Li, Si Liu
- Abstract summary: We propose Guided-Embedding Network(GEN) to attain a two-branch pipeline without post-matching.
For the association, previous two-branch methods suffer from complex and costly post-matching.
For the interaction understanding, previous methods suffer from long-tailed distribution and zero-shot discovery.
- Score: 17.92210977820113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of Human-Object Interaction~(HOI) detection could be divided into
two core problems, i.e., human-object association and interaction
understanding. In this paper, we reveal and address the disadvantages of the
conventional query-driven HOI detectors from the two aspects. For the
association, previous two-branch methods suffer from complex and costly
post-matching, while single-branch methods ignore the features distinction in
different tasks. We propose Guided-Embedding Network~(GEN) to attain a
two-branch pipeline without post-matching. In GEN, we design an instance
decoder to detect humans and objects with two independent query sets and a
position Guided Embedding~(p-GE) to mark the human and object in the same
position as a pair. Besides, we design an interaction decoder to classify
interactions, where the interaction queries are made of instance Guided
Embeddings (i-GE) generated from the outputs of each instance decoder layer.
For the interaction understanding, previous methods suffer from long-tailed
distribution and zero-shot discovery. This paper proposes a Visual-Linguistic
Knowledge Transfer (VLKT) training strategy to enhance interaction
understanding by transferring knowledge from a visual-linguistic pre-trained
model CLIP. In specific, we extract text embeddings for all labels with CLIP to
initialize the classifier and adopt a mimic loss to minimize the visual feature
distance between GEN and CLIP. As a result, GEN-VLKT outperforms the state of
the art by large margins on multiple datasets, e.g., +5.05 mAP on HICO-Det. The
source codes are available at https://github.com/YueLiao/gen-vlkt.
Related papers
- Hierarchical Graph Interaction Transformer with Dynamic Token Clustering for Camouflaged Object Detection [57.883265488038134]
We propose a hierarchical graph interaction network termed HGINet for camouflaged object detection.
The network is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features.
Our experiments demonstrate the superior performance of HGINet compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2024-08-27T12:53:25Z) - Disentangled Pre-training for Human-Object Interaction Detection [22.653500926559833]
We propose an efficient disentangled pre-training method for HOI detection (DP-HOI)
DP-HOI utilizes object detection and action recognition datasets to pre-train the detection and interaction decoder layers.
It significantly enhances the performance of existing HOI detection models on a broad range of rare categories.
arXiv Detail & Related papers (2024-04-02T08:21:16Z) - HOICLIP: Efficient Knowledge Transfer for HOI Detection with
Vision-Language Models [30.279621764192843]
Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions.
Contrastive Language-Image Pre-training (CLIP) has shown great potential in providing interaction prior for HOI detectors.
We propose a novel HOI detection framework that efficiently extracts prior knowledge from CLIP and achieves better generalization.
arXiv Detail & Related papers (2023-03-28T07:54:54Z) - Human-Object Interaction Detection via Disentangled Transformer [63.46358684341105]
We present Disentangled Transformer, where both encoder and decoder are disentangled to facilitate learning of two sub-tasks.
Our method outperforms prior work on two public HOI benchmarks by a sizeable margin.
arXiv Detail & Related papers (2022-04-20T08:15:04Z) - The Overlooked Classifier in Human-Object Interaction Recognition [82.20671129356037]
We encode the semantic correlation among classes into the classification head by initializing the weights with language embeddings of HOIs.
We propose a new loss named LSE-Sign to enhance multi-label learning on a long-tailed dataset.
Our simple yet effective method enables detection-free HOI classification, outperforming the state-of-the-arts that require object detection and human pose by a clear margin.
arXiv Detail & Related papers (2022-03-10T23:35:00Z) - HOTR: End-to-End Human-Object Interaction Detection with Transformers [26.664864824357164]
We present a novel framework, referred to by HOTR, which directly predicts a set of human, object, interaction> triplets from an image.
Our proposed algorithm achieves the state-of-the-art performance in two HOI detection benchmarks with an inference time under 1 ms after object detection.
arXiv Detail & Related papers (2021-04-28T10:10:29Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z) - GID-Net: Detecting Human-Object Interaction with Global and Instance
Dependency [67.95192190179975]
We introduce a two-stage trainable reasoning mechanism, referred to as GID block.
GID-Net is a human-object interaction detection framework consisting of a human branch, an object branch and an interaction branch.
We have compared our proposed GID-Net with existing state-of-the-art methods on two public benchmarks, including V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-11T11:58:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.