Cross-modal Representation Learning for Zero-shot Action Recognition
- URL: http://arxiv.org/abs/2205.01657v1
- Date: Tue, 3 May 2022 17:39:27 GMT
- Title: Cross-modal Representation Learning for Zero-shot Action Recognition
- Authors: Chung-Ching Lin, Kevin Lin, Linjie Li, Lijuan Wang, Zicheng Liu
- Abstract summary: We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
- Score: 67.57406812235767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a cross-modal Transformer-based framework, which jointly encodes
video data and text labels for zero-shot action recognition (ZSAR). Our model
employs a conceptually new pipeline by which visual representations are learned
in conjunction with visual-semantic associations in an end-to-end manner. The
model design provides a natural mechanism for visual and semantic
representations to be learned in a shared knowledge space, whereby it
encourages the learned visual embedding to be discriminative and more
semantically consistent. In zero-shot inference, we devise a simple semantic
transfer scheme that embeds semantic relatedness information between seen and
unseen classes to composite unseen visual prototypes. Accordingly, the
discriminative features in the visual structure could be preserved and
exploited to alleviate the typical zero-shot issues of information loss,
semantic gap, and the hubness problem. Under a rigorous zero-shot setting of
not pre-training on additional datasets, the experiment results show our model
considerably improves upon the state of the arts in ZSAR, reaching encouraging
top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets. Code will
be made available.
Related papers
- RevCD -- Reversed Conditional Diffusion for Generalized Zero-Shot Learning [0.6792605600335813]
In computer vision, knowledge from seen categories is transferred to unseen categories by exploiting the relationships between visual features and available semantic information.
We present a reversed conditional Diffusion-based model (RevCD) that mitigates this issue by generating semantic features from visual inputs.
Our RevCD model consists of a cross Hadamard-Addition embedding of a sinusoidal time schedule and a multi-headed visual transformer for attention-guided embeddings.
arXiv Detail & Related papers (2024-08-31T17:37:26Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - Towards Zero-shot Human-Object Interaction Detection via Vision-Language
Integration [14.678931157058363]
We propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-language model to improve zero-shot HOI detection.
We develop an effective additive self-attention mechanism to generate more comprehensive visual representations.
Our model outperforms the previous methods in various zero-shot and full-supervised settings.
arXiv Detail & Related papers (2024-03-12T02:07:23Z) - Hierarchical Visual Primitive Experts for Compositional Zero-Shot
Learning [52.506434446439776]
Compositional zero-shot learning (CZSL) aims to recognize compositions with prior knowledge of known primitives (attribute and object)
We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues.
Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL.
arXiv Detail & Related papers (2023-08-08T03:24:21Z) - Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - Transductive Zero-Shot Learning by Decoupled Feature Generation [30.664199050468472]
We focus on the transductive setting, in which unlabelled visual data from unseen classes is available.
We propose to decouple tasks of generating realistic visual features and translating semantic attributes into visual cues.
We present a detailed ablation study to dissect the effect of our proposed decoupling approach, while demonstrating its superiority over the related state-of-the-art.
arXiv Detail & Related papers (2021-02-05T16:17:52Z) - Two-Level Adversarial Visual-Semantic Coupling for Generalized Zero-shot
Learning [21.89909688056478]
We propose a new two-level joint idea to augment the generative network with an inference network during training.
This provides strong cross-modal interaction for effective transfer of knowledge between visual and semantic domains.
We evaluate our approach on four benchmark datasets against several state-of-the-art methods, and show its performance.
arXiv Detail & Related papers (2020-07-15T15:34:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.