FreeA: Human-object Interaction Detection using Free Annotation Labels
- URL: http://arxiv.org/abs/2403.01840v1
- Date: Mon, 4 Mar 2024 08:38:15 GMT
- Title: FreeA: Human-object Interaction Detection using Free Annotation Labels
- Authors: Yuxiao Wang, Zhenao Wei, Xinyu Jiang, Yu Lei, Weiying Xue, Jinxiu Liu,
Qi Liu
- Abstract summary: We propose a novel self-adaption language-driven HOI detection method, termed as FreeA, without labeling.
FreeA matches image features of human-object pairs with HOI text templates, and a priori knowledge-based mask method is developed to suppress improbable interactions.
Experiments on two benchmark datasets show that FreeA state-of-the-art performance among weakly supervised HOI models.
- Score: 9.537338958326181
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent human-object interaction (HOI) detection approaches rely on high cost
of manpower and require comprehensive annotated image datasets. In this paper,
we propose a novel self-adaption language-driven HOI detection method, termed
as FreeA, without labeling by leveraging the adaptability of CLIP to generate
latent HOI labels. To be specific, FreeA matches image features of human-object
pairs with HOI text templates, and a priori knowledge-based mask method is
developed to suppress improbable interactions. In addition, FreeA utilizes the
proposed interaction correlation matching method to enhance the likelihood of
actions related to a specified action, further refine the generated HOI labels.
Experiments on two benchmark datasets show that FreeA achieves state-of-the-art
performance among weakly supervised HOI models. Our approach is +8.58 mean
Average Precision (mAP) on HICO-DET and +1.23 mAP on V-COCO more accurate in
localizing and classifying the interactive actions than the newest weakly
model, and +1.68 mAP and +7.28 mAP than the latest weakly+ model, respectively.
Code will be available at https://drliuqi.github.io/.
Related papers
- Robot Instance Segmentation with Few Annotations for Grasping [10.005879464111915]
We propose a novel framework that combines Semi-Supervised Learning (SSL) with Learning Through Interaction (LTI)<n>Our approach exploits partially annotated data through self-supervision and incorporates temporal context using pseudo-sequences generated from unlabeled still images.<n>We validate our method on two common benchmarks, ARMBench mix-object-tote and OCID, where it achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-07-01T13:58:32Z) - Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action
Detection [12.109835641702784]
spatial-temporal action detection is to determine the time and place where each person's action occurs in a video.
Most of the existing methods adopt fully-supervised learning, which requires a large amount of training data.
We propose to utilize a pre-trained visual-language model to extract the representative image and text features.
arXiv Detail & Related papers (2023-04-10T16:08:59Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge
Distillation [86.41437210485932]
We aim at advancing zero-shot HOI detection to detect both seen and unseen HOIs simultaneously.
We propose a novel end-to-end zero-shot HOI Detection framework via vision-language knowledge distillation.
Our method outperforms the previous SOTA by 8.92% on unseen mAP and 10.18% on overall mAP.
arXiv Detail & Related papers (2022-04-01T07:27:19Z) - Decoupling Object Detection from Human-Object Interaction Recognition [37.133695677465376]
DEFR is a DEtection-FRee method to recognize Human-Object Interactions (HOI) at image level without using object location or human pose.
We propose two findings to boost the performance of the detection-free approach, which significantly outperforms the detection-assisted state of the arts.
arXiv Detail & Related papers (2021-12-13T03:01:49Z) - Egocentric Hand-object Interaction Detection and Application [24.68535915849555]
We present a method to detect the hand-object interaction from an egocentric perspective.
We train networks predicting hand pose, hand mask and in-hand object mask to jointly predict the hand-object interaction status.
Our method can run over $textbf30$ FPS which is much efficient than Shan's ($textbf1simtextbf2$ FPS)
arXiv Detail & Related papers (2021-09-29T21:47:16Z) - Towards Good Practices for Efficiently Annotating Large-Scale Image
Classification Datasets [90.61266099147053]
We investigate efficient annotation strategies for collecting multi-class classification labels for a large collection of images.
We propose modifications and best practices aimed at minimizing human labeling effort.
Simulated experiments on a 125k image subset of the ImageNet100 show that it can be annotated to 80% top-1 accuracy with 0.35 annotations per image on average.
arXiv Detail & Related papers (2021-04-26T16:29:32Z) - DecAug: Augmenting HOI Detection via Decomposition [54.65572599920679]
Current algorithms suffer from insufficient training samples and category imbalance within datasets.
We propose an efficient and effective data augmentation method called DecAug for HOI detection.
Experiments show that our method brings up to 3.3 mAP and 1.6 mAP improvements on V-COCO and HICODET dataset.
arXiv Detail & Related papers (2020-10-02T13:59:05Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z) - EHSOD: CAM-Guided End-to-end Hybrid-Supervised Object Detection with
Cascade Refinement [53.69674636044927]
We present EHSOD, an end-to-end hybrid-supervised object detection system.
It can be trained in one shot on both fully and weakly-annotated data.
It achieves comparable results on multiple object detection benchmarks with only 30% fully-annotated data.
arXiv Detail & Related papers (2020-02-18T08:04:58Z) - G2MF-WA: Geometric Multi-Model Fitting with Weakly Annotated Data [15.499276649167975]
In weak annotating, most of the manual annotations are supposed to be correct yet inevitably mixed with incorrect ones.
We propose a novel method to make full use of the WA data to boost the multi-model fitting performance.
arXiv Detail & Related papers (2020-01-20T04:22:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.