Polysemy Deciphering Network for Robust Human-Object Interaction
Detection
- URL: http://arxiv.org/abs/2008.02918v3
- Date: Wed, 24 Mar 2021 01:13:06 GMT
- Title: Polysemy Deciphering Network for Robust Human-Object Interaction
Detection
- Authors: Xubin Zhong, Changxing Ding, Xian Qu, Dacheng Tao
- Abstract summary: We propose a novel Polysemy Deciphering Network (PD-Net) that decodes the visual polysemy of verbs for HOI detection.
We refine features for HOI detection to be polysemyaware through the use of two novel modules.
Second, we introduce a novel Polysemy-Aware Modal Fusion module (PAMF) which guides PD-Net to make decisions based on feature types deemed more important according to the language priors.
- Score: 86.97181280842098
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-Object Interaction (HOI) detection is important to human-centric scene
understanding tasks. Existing works tend to assume that the same verb has
similar visual characteristics in different HOI categories, an approach that
ignores the diverse semantic meanings of the verb. To address this issue, in
this paper, we propose a novel Polysemy Deciphering Network (PD-Net) that
decodes the visual polysemy of verbs for HOI detection in three distinct ways.
First, we refine features for HOI detection to be polysemyaware through the use
of two novel modules: namely, Language Prior-guided Channel Attention (LPCA)
and Language Prior-based Feature Augmentation (LPFA). LPCA highlights important
elements in human and object appearance features for each HOI category to be
identified; moreover, LPFA augments human pose and spatial features for HOI
detection using language priors, enabling the verb classifiers to receive
language hints that reduce intra-class variation for the same verb. Second, we
introduce a novel Polysemy-Aware Modal Fusion module (PAMF), which guides
PD-Net to make decisions based on feature types deemed more important according
to the language priors. Third, we propose to relieve the verb polysemy problem
through sharing verb classifiers for semantically similar HOI categories.
Furthermore, to expedite research on the verb polysemy problem, we build a new
benchmark dataset named HOI-VerbPolysemy (HOIVP), which includes common verbs
(predicates) that have diverse semantic meanings in the real world. Finally,
through deciphering the visual polysemy of verbs, our approach is demonstrated
to outperform state-of-the-art methods by significant margins on the HICO-DET,
V-COCO, and HOI-VP databases. Code and data in this paper are available at
https://github.com/MuchHair/PD-Net.
Related papers
- Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection [37.57355457749918]
We introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP.
Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction.
Experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings.
arXiv Detail & Related papers (2024-08-05T14:05:25Z) - Multi-Stream Keypoint Attention Network for Sign Language Recognition and Translation [3.976851945232775]
Current approaches for sign language recognition rely on RGB video inputs, which are vulnerable to fluctuations in the background.
We propose a multi-stream keypoint attention network to depict a sequence of keypoints produced by a readily available keypoint estimator.
We carry out comprehensive experiments on well-known benchmarks like Phoenix-2014, Phoenix-2014T, and CSL-Daily to showcase the efficacy of our methodology.
arXiv Detail & Related papers (2024-05-09T10:58:37Z) - Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection [9.788417605537965]
We introduce a novel end-to-end open vocabulary HOI detection framework with conditional multi-level decoding and fine-grained semantic enhancement.
Our proposed method achieves state-of-the-art results in open vocabulary HOI detection.
arXiv Detail & Related papers (2024-04-09T10:27:22Z) - Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD)
We adopt a standard two-stage object detector architecture.
We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - RefCrowd: Grounding the Target in Crowd with Referring Expressions [20.822504213866726]
We propose RefCrowd, which towards looking for the target person in crowd with referring expressions.
It not only requires to sufficiently mine the natural language information, but also requires to carefully focus on subtle differences between the target and a crowd of persons with similar appearance.
We also propose a Fine-grained Multi-modal Attribute Contrastive Network (FMAC) to deal with REF in crowd understanding.
arXiv Detail & Related papers (2022-06-16T13:39:26Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.