TTAN: Two-Stage Temporal Alignment Network for Few-shot Action
Recognition
- URL: http://arxiv.org/abs/2107.04782v1
- Date: Sat, 10 Jul 2021 07:22:49 GMT
- Title: TTAN: Two-Stage Temporal Alignment Network for Few-shot Action
Recognition
- Authors: Shuyuan Li, Huabin Liu, Rui Qian, Yuxi Li, John See, Mengjuan Fei,
Xiaoyuan Yu, Weiyao Lin
- Abstract summary: Few-shot action recognition aims to recognize novel action classes (query) using just a few samples (support)
We devise a novel multi-shot fusion strategy, which takes the misalignment among support samples into consideration.
Experiments on benchmark datasets show the potential of the proposed method in achieving state-of-the-art performance for few-shot action recognition.
- Score: 29.95184808021684
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Few-shot action recognition aims to recognize novel action classes (query)
using just a few samples (support). The majority of current approaches follow
the metric learning paradigm, which learns to compare the similarity between
videos. Recently, it has been observed that directly measuring this similarity
is not ideal since different action instances may show distinctive temporal
distribution, resulting in severe misalignment issues across query and support
videos. In this paper, we arrest this problem from two distinct aspects --
action duration misalignment and motion evolution misalignment. We address them
sequentially through a Two-stage Temporal Alignment Network (TTAN). The first
stage performs temporal transformation with the predicted affine warp
parameters, while the second stage utilizes a cross-attention mechanism to
coordinate the features of the support and query to a consistent evolution.
Besides, we devise a novel multi-shot fusion strategy, which takes the
misalignment among support samples into consideration. Ablation studies and
visualizations demonstrate the role played by both stages in addressing the
misalignment. Extensive experiments on benchmark datasets show the potential of
the proposed method in achieving state-of-the-art performance for few-shot
action recognition.
Related papers
- Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling [51.38330727868982]
Bidirectional Decoding (BID) is a test-time inference algorithm that bridges action chunking with closed-loop operations.
We show that BID boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
arXiv Detail & Related papers (2024-08-30T15:39:34Z) - Skeleton-based Action Recognition through Contrasting Two-Stream
Spatial-Temporal Networks [11.66009967197084]
We propose a novel Contrastive GCN-Transformer Network (ConGT) which fuses the spatial and temporal modules in a parallel way.
We conduct experiments on three benchmark datasets, which demonstrate that our model achieves state-of-the-art performance in action recognition.
arXiv Detail & Related papers (2023-01-27T02:12:08Z) - Spatiotemporal Multi-scale Bilateral Motion Network for Gait Recognition [3.1240043488226967]
In this paper, motivated by optical flow, the bilateral motion-oriented features are proposed.
We develop a set of multi-scale temporal representations that force the motion context to be richly described at various levels of temporal resolution.
arXiv Detail & Related papers (2022-09-26T01:36:22Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications.
We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class.
Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z) - Representation Learning via Global Temporal Alignment and
Cycle-Consistency [20.715813546383178]
We introduce a weakly supervised method for representation learning based on aligning temporal sequences.
We report significant performance increases over previous methods.
In addition, we report two applications of our temporal alignment framework, namely 3D pose reconstruction and fine-grained audio/visual retrieval.
arXiv Detail & Related papers (2021-05-11T17:34:04Z) - Few-shot Action Recognition with Prototype-centered Attentive Learning [88.10852114988829]
Prototype-centered Attentive Learning (PAL) model composed of two novel components.
First, a prototype-centered contrastive learning loss is introduced to complement the conventional query-centered learning objective.
Second, PAL integrates a attentive hybrid learning mechanism that can minimize the negative impacts of outliers.
arXiv Detail & Related papers (2021-01-20T11:48:12Z) - Temporal-Relational CrossTransformers for Few-Shot Action Recognition [82.0033565755246]
We propose a novel approach to few-shot action recognition, finding temporally-corresponding frames between the query and videos in the support set.
Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos.
A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order CrossTransformers.
arXiv Detail & Related papers (2021-01-15T15:47:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.