Learning Using Privileged Information for Zero-Shot Action Recognition
- URL: http://arxiv.org/abs/2206.08632v1
- Date: Fri, 17 Jun 2022 08:46:09 GMT
- Title: Learning Using Privileged Information for Zero-Shot Action Recognition
- Authors: Zhiyi Gao, Wanqing Li, Zihui Guo, Bin Yu and Yonghong Hou
- Abstract summary: This paper presents a novel method that uses object semantics as privileged information to narrow the semantic gap.
Experiments on the Olympic Sports, HMDB51 and UCF101 datasets have shown that the proposed method outperforms the state-of-the-art methods by a large margin.
- Score: 15.9032110752123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-Shot Action Recognition (ZSAR) aims to recognize video actions that have
never been seen during training. Most existing methods assume a shared semantic
space between seen and unseen actions and intend to directly learn a mapping
from a visual space to the semantic space. This approach has been challenged by
the semantic gap between the visual space and semantic space. This paper
presents a novel method that uses object semantics as privileged information to
narrow the semantic gap and, hence, effectively, assist the learning. In
particular, a simple hallucination network is proposed to implicitly extract
object semantics during testing without explicitly extracting objects and a
cross-attention module is developed to augment visual feature with the object
semantics. Experiments on the Olympic Sports, HMDB51 and UCF101 datasets have
shown that the proposed method outperforms the state-of-the-art methods by a
large margin.
Related papers
- Disentangling Dense Embeddings with Sparse Autoencoders [0.0]
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks.
We present one of the first applications of SAEs to dense text embeddings from large language models.
We show that the resulting sparse representations maintain semantic fidelity while offering interpretability.
arXiv Detail & Related papers (2024-08-01T15:46:22Z) - Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning [56.65891462413187]
We propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT)
ZSLViT first introduces semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement.
Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement.
arXiv Detail & Related papers (2024-04-11T12:59:38Z) - Towards Zero-shot Human-Object Interaction Detection via Vision-Language
Integration [14.678931157058363]
We propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-language model to improve zero-shot HOI detection.
We develop an effective additive self-attention mechanism to generate more comprehensive visual representations.
Our model outperforms the previous methods in various zero-shot and full-supervised settings.
arXiv Detail & Related papers (2024-03-12T02:07:23Z) - VILLS -- Video-Image Learning to Learn Semantics for Person Re-Identification [51.89551385538251]
We propose VILLS (Video-Image Learning to Learn Semantics), a self-supervised method that jointly learns spatial and temporal features from images and videos.
VILLS first designs a local semantic extraction module that adaptively extracts semantically consistent and robust spatial features.
Then, VILLS designs a unified feature learning and adaptation module to represent image and video modalities in a consistent feature space.
arXiv Detail & Related papers (2023-11-27T19:30:30Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - Learning Semantics for Visual Place Recognition through Multi-Scale
Attention [14.738954189759156]
We present the first VPR algorithm that learns robust global embeddings from both visual appearance and semantic content of the data.
Experiments on various scenarios validate this new approach and demonstrate its performance against state-of-the-art methods.
arXiv Detail & Related papers (2022-01-24T14:13:12Z) - Tell me what you see: A zero-shot action recognition method based on
natural language descriptions [3.136605193634262]
We propose using video captioning methods to extract semantic information from videos.
To the best of our knowledge, this is the first work to represent both videos and labels with descriptive sentences.
We build a shared semantic space employing BERT-based embedders pre-trained in the paraphrasing task on multiple text datasets.
arXiv Detail & Related papers (2021-12-18T17:44:07Z) - Synthesizing the Unseen for Zero-shot Object Detection [72.38031440014463]
We propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain.
We use a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them.
arXiv Detail & Related papers (2020-10-19T12:36:11Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.