Spatio-temporal Relation Modeling for Few-shot Action Recognition
- URL: http://arxiv.org/abs/2112.05132v1
- Date: Thu, 9 Dec 2021 18:59:14 GMT
- Title: Spatio-temporal Relation Modeling for Few-shot Action Recognition
- Authors: Anirudh Thatipelli, Sanath Narayan, Salman Khan, Rao Muhammad Anwer,
Fahad Shahbaz Khan, Bernard Ghanem
- Abstract summary: We propose a few-shot action recognition framework, STRM, which enhances class-specific featureriminability while simultaneously learning higher-order temporal representations.
Our approach achieves an absolute gain of 3.5% in classification accuracy, as compared to the best existing method in the literature.
- Score: 100.3999454780478
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We propose a novel few-shot action recognition framework, STRM, which
enhances class-specific feature discriminability while simultaneously learning
higher-order temporal representations. The focus of our approach is a novel
spatio-temporal enrichment module that aggregates spatial and temporal contexts
with dedicated local patch-level and global frame-level feature enrichment
sub-modules. Local patch-level enrichment captures the appearance-based
characteristics of actions. On the other hand, global frame-level enrichment
explicitly encodes the broad temporal context, thereby capturing the relevant
object features over time. The resulting spatio-temporally enriched
representations are then utilized to learn the relational matching between
query and support action sub-sequences. We further introduce a query-class
similarity classifier on the patch-level enriched features to enhance
class-specific feature discriminability by reinforcing the feature learning at
different stages in the proposed framework. Experiments are performed on four
few-shot action recognition benchmarks: Kinetics, SSv2, HMDB51 and UCF101. Our
extensive ablation study reveals the benefits of the proposed contributions.
Furthermore, our approach sets a new state-of-the-art on all four benchmarks.
On the challenging SSv2 benchmark, our approach achieves an absolute gain of
3.5% in classification accuracy, as compared to the best existing method in the
literature. Our code and models will be publicly released.
Related papers
- Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition [14.97527336050901]
We propose a novel Temporal Sequence-Aware Model (TSAM) for few-shot action recognition (FSAR)
It incorporates a sequential perceiver adapter into the pre-training framework, to integrate both the spatial information and the sequential temporal dynamics into the feature embeddings.
Experimental results on five FSAR datasets demonstrate that our method set a new benchmark, beating the second-best competitors with large margins.
arXiv Detail & Related papers (2024-08-22T15:13:27Z) - SMILe: Leveraging Submodular Mutual Information For Robust Few-Shot Object Detection [2.0755366440393743]
Confusion and forgetting of object classes have been challenges of prime interest in Few-Shot Object Detection (FSOD)
We introduce a novel Submodular Mutual Information Learning framework which adopts mutual information functions.
Our proposed approach generalizes to several existing approaches in FSOD, agnostic of the backbone architecture.
arXiv Detail & Related papers (2024-07-02T20:53:43Z) - Hierarchical Spatio-Temporal Representation Learning for Gait
Recognition [6.877671230651998]
Gait recognition is a biometric technique that identifies individuals by their unique walking styles.
We propose a hierarchical-temporal representation learning framework for extracting gait features from coarse to fine.
Our method outperforms the state-of-the-art while maintaining a reasonable balance between model accuracy and complexity.
arXiv Detail & Related papers (2023-07-19T09:30:00Z) - Multi-modal Prompting for Low-Shot Temporal Action Localization [95.19505874963751]
We consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario.
We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification.
arXiv Detail & Related papers (2023-03-21T10:40:13Z) - Action Quality Assessment with Temporal Parsing Transformer [84.1272079121699]
Action Quality Assessment (AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences.
We propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations.
Our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.
arXiv Detail & Related papers (2022-07-19T13:29:05Z) - CAD: Co-Adapting Discriminative Features for Improved Few-Shot
Classification [11.894289991529496]
Few-shot classification is a challenging problem that aims to learn a model that can adapt to unseen classes given a few labeled samples.
Recent approaches pre-train a feature extractor, and then fine-tune for episodic meta-learning.
We propose a strategy to cross-attend and re-weight discriminative features for few-shot classification.
arXiv Detail & Related papers (2022-03-25T06:14:51Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - MCDAL: Maximum Classifier Discrepancy for Active Learning [74.73133545019877]
Recent state-of-the-art active learning methods have mostly leveraged Generative Adversarial Networks (GAN) for sample acquisition.
We propose in this paper a novel active learning framework that we call Maximum Discrepancy for Active Learning (MCDAL)
In particular, we utilize two auxiliary classification layers that learn tighter decision boundaries by maximizing the discrepancies among them.
arXiv Detail & Related papers (2021-07-23T06:57:08Z) - Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels.
We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects.
The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.