Revisiting the Spatial and Temporal Modeling for Few-shot Action
Recognition
- URL: http://arxiv.org/abs/2301.07944v2
- Date: Sat, 8 Apr 2023 03:29:05 GMT
- Title: Revisiting the Spatial and Temporal Modeling for Few-shot Action
Recognition
- Authors: Jiazheng Xing, Mengmeng Wang, Yong Liu, Boyu Mu
- Abstract summary: We propose SloshNet, a new framework that revisits the spatial and temporal modeling for few-shot action recognition in a finer manner.
We extensively validate the proposed SloshNet on four few-shot action recognition datasets, including Something-Something V2, Kinetics, UCF101, and HMDB51.
- Score: 16.287968292213563
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Spatial and temporal modeling is one of the most core aspects of few-shot
action recognition. Most previous works mainly focus on long-term temporal
relation modeling based on high-level spatial representations, without
considering the crucial low-level spatial features and short-term temporal
relations. Actually, the former feature could bring rich local semantic
information, and the latter feature could represent motion characteristics of
adjacent frames, respectively. In this paper, we propose SloshNet, a new
framework that revisits the spatial and temporal modeling for few-shot action
recognition in a finer manner. First, to exploit the low-level spatial
features, we design a feature fusion architecture search module to
automatically search for the best combination of the low-level and high-level
spatial features. Next, inspired by the recent transformer, we introduce a
long-term temporal modeling module to model the global temporal relations based
on the extracted spatial appearance features. Meanwhile, we design another
short-term temporal modeling module to encode the motion characteristics
between adjacent frame representations. After that, the final predictions can
be obtained by feeding the embedded rich spatial-temporal features to a common
frame-level class prototype matcher. We extensively validate the proposed
SloshNet on four few-shot action recognition datasets, including
Something-Something V2, Kinetics, UCF101, and HMDB51. It achieves favorable
results against state-of-the-art methods in all datasets.
Related papers
- ColorMNet: A Memory-based Deep Spatial-Temporal Feature Propagation Network for Video Colorization [62.751303924391564]
How to effectively explore spatial-temporal features is important for video colorization.
We develop a memory-based feature propagation module that can establish reliable connections with features from far-apart frames.
We develop a local attention module to aggregate the features from adjacent frames in a spatial-temporal neighborhood.
arXiv Detail & Related papers (2024-04-09T12:23:30Z) - A Decoupled Spatio-Temporal Framework for Skeleton-based Action
Segmentation [89.86345494602642]
Existing methods are limited in weak-temporal modeling capability.
We propose a Decoupled Scoupled Framework (DeST) to address the issues.
DeST significantly outperforms current state-of-the-art methods with less computational complexity.
arXiv Detail & Related papers (2023-12-10T09:11:39Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial
Video Classification [49.06447472006251]
We propose a novel deep neural network, termed FuTH-Net, to model not only holistic features, but also temporal relations for aerial video classification.
Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results.
arXiv Detail & Related papers (2022-09-22T21:15:58Z) - Spatial Temporal Graph Attention Network for Skeleton-Based Action
Recognition [10.60209288486904]
It's common for current methods in skeleton-based action recognition to mainly consider capturing long-term temporal dependencies.
We propose a general framework, coined as STGAT, to model cross-spacetime information flow.
STGAT achieves state-of-the-art performance on three large-scale datasets.
arXiv Detail & Related papers (2022-08-18T02:34:46Z) - Motion-aware Memory Network for Fast Video Salient Object Detection [15.967509480432266]
We design a space-time memory (STM)-based network, which extracts useful temporal information of the current frame from adjacent frames as the temporal branch of VSOD.
In the encoding stage, we generate high-level temporal features by using high-level features from the current and its adjacent frames.
In the decoding stage, we propose an effective fusion strategy for spatial and temporal branches.
The proposed model does not require optical flow or other preprocessing, and can reach a speed of nearly 100 FPS during inference.
arXiv Detail & Related papers (2022-08-01T15:56:19Z) - Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based
Motion Recognition [62.46544616232238]
Previous motion recognition methods have achieved promising performance through the tightly coupled multi-temporal representation.
We propose to decouple and recouple caused caused representation for RGB-D-based motion recognition.
arXiv Detail & Related papers (2021-12-16T18:59:47Z) - TEA: Temporal Excitation and Aggregation for Action Recognition [31.076707274791957]
We propose a Temporal Excitation and Aggregation block, including a motion excitation module and a multiple temporal aggregation module.
For short-range motion modeling, the ME module calculates the feature-level temporal differences fromtemporal features.
The MTA module proposes to deform the local convolution to a group of sub-convolutions, forming a hierarchical residual architecture.
arXiv Detail & Related papers (2020-04-03T06:53:30Z) - Disentangling and Unifying Graph Convolutions for Skeleton-Based Action
Recognition [79.33539539956186]
We propose a simple method to disentangle multi-scale graph convolutions and a unified spatial-temporal graph convolutional operator named G3D.
By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets.
arXiv Detail & Related papers (2020-03-31T11:28:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.