Temporal-Relational CrossTransformers for Few-Shot Action Recognition
- URL: http://arxiv.org/abs/2101.06184v2
- Date: Thu, 18 Mar 2021 15:02:00 GMT
- Title: Temporal-Relational CrossTransformers for Few-Shot Action Recognition
- Authors: Toby Perrett and Alessandro Masullo and Tilo Burghardt and Majid
Mirmehdi and Dima Damen
- Abstract summary: We propose a novel approach to few-shot action recognition, finding temporally-corresponding frames between the query and videos in the support set.
Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos.
A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order CrossTransformers.
- Score: 82.0033565755246
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel approach to few-shot action recognition, finding
temporally-corresponding frame tuples between the query and videos in the
support set. Distinct from previous few-shot works, we construct class
prototypes using the CrossTransformer attention mechanism to observe relevant
sub-sequences of all support videos, rather than using class averages or single
best matches. Video representations are formed from ordered tuples of varying
numbers of frames, which allows sub-sequences of actions at different speeds
and temporal offsets to be compared.
Our proposed Temporal-Relational CrossTransformers (TRX) achieve
state-of-the-art results on few-shot splits of Kinetics, Something-Something V2
(SSv2), HMDB51 and UCF101. Importantly, our method outperforms prior work on
SSv2 by a wide margin (12%) due to the its ability to model temporal relations.
A detailed ablation showcases the importance of matching to multiple support
set videos and learning higher-order relational CrossTransformers.
Related papers
- Task-Specific Alignment and Multiple Level Transformer for Few-Shot
Action Recognition [11.700737340560796]
In recent years, some works have used the Transformer to deal with frames, then get the attention feature and the enhanced prototype, and the results are competitive.
We address these problems through an end-to-end method named "Task-Specific Alignment and Multiple-level Transformer Network (TSA-MLT)"
Our method achieves state-of-the-art results on the HMDB51 and UCF101 datasets and a competitive result on the benchmark of Kinetics and something 2-something V2 datasets.
arXiv Detail & Related papers (2023-07-05T02:13:25Z) - HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot
Action Recognition [51.2715005161475]
We propose a novel Hybrid Relation guided temporal Set Matching approach for few-shot action recognition.
The core idea of HyRSM++ is to integrate all videos within the task to learn discriminative representations.
We show that our method achieves state-of-the-art performance under various few-shot settings.
arXiv Detail & Related papers (2023-01-09T13:32:50Z) - Alignment-guided Temporal Attention for Video Action Recognition [18.5171795689609]
We show that frame-by-frame alignments have the potential to increase the mutual information between frame representations.
We propose Alignment-guided Temporal Attention (ATA) to extend 1-dimensional temporal attention with parameter-free patch-level alignments between neighboring frames.
arXiv Detail & Related papers (2022-09-30T23:10:47Z) - Inductive and Transductive Few-Shot Video Classification via Appearance
and Temporal Alignments [17.673345523918947]
We present a novel method for few-shot video classification, which performs appearance and temporal alignments.
Our approach achieves similar or better results than previous methods on both datasets.
arXiv Detail & Related papers (2022-07-21T23:28:52Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - VRT: A Video Restoration Transformer [126.79589717404863]
Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames.
We propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities.
arXiv Detail & Related papers (2022-01-28T17:54:43Z) - TTAN: Two-Stage Temporal Alignment Network for Few-shot Action
Recognition [29.95184808021684]
Few-shot action recognition aims to recognize novel action classes (query) using just a few samples (support)
We devise a novel multi-shot fusion strategy, which takes the misalignment among support samples into consideration.
Experiments on benchmark datasets show the potential of the proposed method in achieving state-of-the-art performance for few-shot action recognition.
arXiv Detail & Related papers (2021-07-10T07:22:49Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - All at Once: Temporally Adaptive Multi-Frame Interpolation with Advanced
Motion Modeling [52.425236515695914]
State-of-the-art methods are iterative solutions interpolating one frame at the time.
This work introduces a true multi-frame interpolator.
It utilizes a pyramidal style network in the temporal domain to complete the multi-frame task in one-shot.
arXiv Detail & Related papers (2020-07-23T02:34:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.