Zero-Shot Action Recognition with Transformer-based Video Semantic
Embedding
- URL: http://arxiv.org/abs/2203.05156v1
- Date: Thu, 10 Mar 2022 05:03:58 GMT
- Title: Zero-Shot Action Recognition with Transformer-based Video Semantic
Embedding
- Authors: Keval Doshi and Yasin Yilmaz
- Abstract summary: We take a new comprehensive look at the inductive zero-shot action recognition problem from a realistic standpoint.
Specifically, we advocate for a concrete formulation for zero-shot action recognition that avoids an exact overlap between the training and testing classes.
We propose a novel end-to-end trained transformer model which is capable of capturing long rangetemporal dependencies efficiently.
- Score: 36.24563211765782
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While video action recognition has been an active area of research for
several years, zero-shot action recognition has only recently started gaining
traction. However, there is a lack of a formal definition for the zero-shot
learning paradigm leading to uncertainty about classes that can be considered
as previously unseen. In this work, we take a new comprehensive look at the
inductive zero-shot action recognition problem from a realistic standpoint.
Specifically, we advocate for a concrete formulation for zero-shot action
recognition that avoids an exact overlap between the training and testing
classes and also limits the intra-class variance; and propose a novel
end-to-end trained transformer model which is capable of capturing long range
spatiotemporal dependencies efficiently, contrary to existing approaches which
use 3D-CNNs. The proposed approach outperforms the existing state-of-the-art
algorithms in many settings on all benchmark datasets by a wide margin.
Related papers
- Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision.
We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos.
Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z) - Open Set Action Recognition via Multi-Label Evidential Learning [25.15753429188536]
We propose a new method for open set action recognition and novelty detection via MUlti-Label Evidential learning (MULE)
Our Beta Evidential Neural Network estimates multi-action uncertainty with Beta densities based on actor-context-object relation representations.
Our proposed approach achieves promising performance in single/multi-actor, single/multi-action settings.
arXiv Detail & Related papers (2023-02-27T18:34:18Z) - Zero-Shot Temporal Action Detection via Vision-Language Prompting [134.26292288193298]
We propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE)
Our model significantly outperforms state-of-the-art alternatives.
Our model also yields superior results on supervised TAD over recent strong competitors.
arXiv Detail & Related papers (2022-07-17T13:59:46Z) - DirecFormer: A Directed Attention in Transformer Approach to Robust
Action Recognition [22.649489578944838]
This work presents a novel end-to-end Transformer-based Directed Attention framework for robust action recognition.
The contributions of this work are three-fold. Firstly, we introduce the problem of ordered temporal learning issues to the action recognition problem.
Secondly, a new Directed Attention mechanism is introduced to understand and provide attentions to human actions in the right order.
arXiv Detail & Related papers (2022-03-19T03:41:48Z) - Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications.
We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class.
Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z) - A New Split for Evaluating True Zero-Shot Action Recognition [45.815342448662946]
We propose a new split for true zero-shot action recognition with no overlap between unseen test classes and training or pre-training classes.
We benchmark several recent approaches on the proposed True Zero-Shot (TruZe) Split for UCF101 and HMDB51.
arXiv Detail & Related papers (2021-07-27T18:22:39Z) - Few-shot Action Recognition with Prototype-centered Attentive Learning [88.10852114988829]
Prototype-centered Attentive Learning (PAL) model composed of two novel components.
First, a prototype-centered contrastive learning loss is introduced to complement the conventional query-centered learning objective.
Second, PAL integrates a attentive hybrid learning mechanism that can minimize the negative impacts of outliers.
arXiv Detail & Related papers (2021-01-20T11:48:12Z) - Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed
Videos [82.02074241700728]
In this paper, we present a prohibitive-level action recognition model that is trained with only video-frame labels.
Our method per person detectors have been trained on large image datasets within Multiple Instance Learning framework.
We show how we can apply our method in cases where the standard Multiple Instance Learning assumption, that each bag contains at least one instance with the specified label, is invalid.
arXiv Detail & Related papers (2020-07-21T10:45:05Z) - ZSTAD: Zero-Shot Temporal Activity Detection [107.63759089583382]
We propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training can still be detected.
We design an end-to-end deep network based on R-C3D as the architecture for this solution.
Experiments on both the THUMOS14 and the Charades datasets show promising performance in terms of detecting unseen activities.
arXiv Detail & Related papers (2020-03-12T02:40:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.