Few-shot Action Recognition with Permutation-invariant Attention
- URL: http://arxiv.org/abs/2001.03905v3
- Date: Tue, 4 Aug 2020 02:44:04 GMT
- Title: Few-shot Action Recognition with Permutation-invariant Attention
- Authors: Hongguang Zhang, Li Zhang, Xiaojuan Qi, Hongdong Li, Philip H. S.
Torr, Piotr Koniusz
- Abstract summary: We build on a C3D encoder for encoded video blocks to capture short-range action patterns.
We exploit spatial and temporal attention modules and naturalistic self-supervision.
Our method outperforms the state of the art on the HMDB51, UCF101, miniMIT datasets.
- Score: 169.61294360056925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many few-shot learning models focus on recognising images. In contrast, we
tackle a challenging task of few-shot action recognition from videos. We build
on a C3D encoder for spatio-temporal video blocks to capture short-range action
patterns. Such encoded blocks are aggregated by permutation-invariant pooling
to make our approach robust to varying action lengths and long-range temporal
dependencies whose patterns are unlikely to repeat even in clips of the same
class. Subsequently, the pooled representations are combined into simple
relation descriptors which encode so-called query and support clips. Finally,
relation descriptors are fed to the comparator with the goal of similarity
learning between query and support clips. Importantly, to re-weight block
contributions during pooling, we exploit spatial and temporal attention modules
and self-supervision. In naturalistic clips (of the same class) there exists a
temporal distribution shift--the locations of discriminative temporal action
hotspots vary. Thus, we permute blocks of a clip and align the resulting
attention regions with similarly permuted attention regions of non-permuted
clip to train the attention mechanism invariant to block (and thus long-term
hotspot) permutations. Our method outperforms the state of the art on the
HMDB51, UCF101, miniMIT datasets.
Related papers
- FMI-TAL: Few-shot Multiple Instances Temporal Action Localization by Probability Distribution Learning and Interval Cluster Refinement [2.261014973523156]
We propose a novel solution involving a spatial-channel relation transformer with probability learning and cluster refinement.
This method can accurately identify the start and end boundaries of actions in the query video.
Our model achieves competitive performance through meticulous experimentation utilizing the benchmark datasets ActivityNet1.3 and THUMOS14.
arXiv Detail & Related papers (2024-08-25T08:17:25Z) - TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and
Clustering [27.52568444236988]
We propose an unsupervised approach for learning action classes from untrimmed video sequences.
In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning.
Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes.
arXiv Detail & Related papers (2023-03-09T10:46:23Z) - Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised
Correspondence Learning [74.03651142051656]
We develop LIIR, a locality-aware inter-and intra-video reconstruction framework.
We exploit cross video affinities as extra negative samples within a unified, inter-and intra-video reconstruction scheme.
arXiv Detail & Related papers (2022-03-27T15:46:42Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Modelling Neighbor Relation in Joint Space-Time Graph for Video
Correspondence Learning [53.74240452117145]
This paper presents a self-supervised method for learning reliable visual correspondence from unlabeled videos.
We formulate the correspondence as finding paths in a joint space-time graph, where nodes are grid patches sampled from frames, and are linked by two types of edges.
Our learned representation outperforms the state-of-the-art self-supervised methods on a variety of visual tasks.
arXiv Detail & Related papers (2021-09-28T05:40:01Z) - Learning by Aligning Videos in Time [10.075645944474287]
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task.
We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network.
arXiv Detail & Related papers (2021-03-31T17:55:52Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z) - Multi-Granularity Reference-Aided Attentive Feature Aggregation for
Video-based Person Re-identification [98.7585431239291]
Video-based person re-identification aims at matching the same person across video clips.
In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-Attentive Feature aggregation module MG-RAFA.
Our framework achieves the state-of-the-art ablation performance on three benchmark datasets.
arXiv Detail & Related papers (2020-03-27T03:49:21Z) - STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos [17.232631075144592]
Methods for instance segmentation in videos typically follow the tracking-by-detection paradigm.
We propose a novel approach that segments and tracks instances across space and time in a single stage.
Our method achieves state-of-the-art results across multiple datasets and tasks.
arXiv Detail & Related papers (2020-03-18T18:40:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.