TSI: Temporal Saliency Integration for Video Action Recognition
- URL: http://arxiv.org/abs/2106.01088v1
- Date: Wed, 2 Jun 2021 11:43:49 GMT
- Title: TSI: Temporal Saliency Integration for Video Action Recognition
- Authors: Haisheng Su, Jinyuan Feng, Dongliang Wang, Weihao Gan, Wei Wu, Yu Qiao
- Abstract summary: We propose a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-scale Temporal Integration (CTI) module.
SME aims to highlight the motion-sensitive area through local-global motion modeling.
CTI is designed to perform multi-scale temporal modeling through a group of separate 1D convolutions respectively.
- Score: 32.18535820790586
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficient spatiotemporal modeling is an important yet challenging problem for
video action recognition. Existing state-of-the-art methods exploit motion
clues to assist in short-term temporal modeling through temporal difference
over consecutive frames. However, background noises will be inevitably
introduced due to the camera movement. Besides, movements of different actions
can vary greatly. In this paper, we propose a Temporal Saliency Integration
(TSI) block, which mainly contains a Salient Motion Excitation (SME) module and
a Cross-scale Temporal Integration (CTI) module. Specifically, SME aims to
highlight the motion-sensitive area through local-global motion modeling, where
the background suppression and pyramidal feature difference are conducted
successively between neighboring frames to capture motion dynamics with less
background noises. CTI is designed to perform multi-scale temporal modeling
through a group of separate 1D convolutions respectively. Meanwhile, temporal
interactions across different scales are integrated with attention mechanism.
Through these two modules, long short-term temporal relationships can be
encoded efficiently by introducing limited additional parameters. Extensive
experiments are conducted on several popular benchmarks (i.e.,
Something-Something v1 & v2, Kinetics-400, UCF-101, and HMDB-51), which
demonstrate the effectiveness and superiority of our proposed method.
Related papers
- A Decoupled Spatio-Temporal Framework for Skeleton-based Action
Segmentation [89.86345494602642]
Existing methods are limited in weak-temporal modeling capability.
We propose a Decoupled Scoupled Framework (DeST) to address the issues.
DeST significantly outperforms current state-of-the-art methods with less computational complexity.
arXiv Detail & Related papers (2023-12-10T09:11:39Z) - ProgressiveMotionSeg: Mutually Reinforced Framework for Event-Based
Motion Segmentation [101.19290845597918]
This paper presents a Motion Estimation (ME) module and an Event Denoising (ED) module jointly optimized in a mutually reinforced manner.
Taking temporal correlation as guidance, ED module calculates the confidence that each event belongs to real activity events, and transmits it to ME module to update energy function of motion segmentation for noise suppression.
arXiv Detail & Related papers (2022-03-22T13:40:26Z) - Behavior Recognition Based on the Integration of Multigranular Motion
Features [17.052997301790693]
We propose a novel behavior recognition method based on the integration of multigranular (IMG) motion features.
We evaluate our model on several action recognition benchmarks such as HMDB51, Something-Something and UCF101.
arXiv Detail & Related papers (2022-03-07T02:05:26Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - Long-Short Temporal Modeling for Efficient Action Recognition [32.159784061961886]
We propose a new two-stream action recognition network, termed as MENet, consisting of a Motion Enhancement (ME) module and a Video-level Aggregation (VLA) module.
For short-term motions, we design an efficient ME module to enhance the short-term motions by mingling the motion saliency among neighboring segments.
As for long-term aggregations, VLA is adopted at the top of the appearance branch to integrate the long-term dependencies across all segments.
arXiv Detail & Related papers (2021-06-30T02:54:13Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - PAN: Towards Fast Action Recognition via Learning Persistence of
Appearance [60.75488333935592]
Most state-of-the-art methods heavily rely on dense optical flow as motion representation.
In this paper, we shed light on fast action recognition by lifting the reliance on optical flow.
We design a novel motion cue called Persistence of Appearance (PA)
In contrast to optical flow, our PA focuses more on distilling the motion information at boundaries.
arXiv Detail & Related papers (2020-08-08T07:09:54Z) - Learn to cycle: Time-consistent feature discovery for action recognition [83.43682368129072]
Generalizing over temporal variations is a prerequisite for effective action recognition in videos.
We introduce Squeeze Re Temporal Gates (SRTG), an approach that favors temporal activations with potential variations.
We show consistent improvement when using SRTPG blocks, with only a minimal increase in the number of GFLOs.
arXiv Detail & Related papers (2020-06-15T09:36:28Z) - TEA: Temporal Excitation and Aggregation for Action Recognition [31.076707274791957]
We propose a Temporal Excitation and Aggregation block, including a motion excitation module and a multiple temporal aggregation module.
For short-range motion modeling, the ME module calculates the feature-level temporal differences fromtemporal features.
The MTA module proposes to deform the local convolution to a group of sub-convolutions, forming a hierarchical residual architecture.
arXiv Detail & Related papers (2020-04-03T06:53:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.