Slow-Fast Visual Tempo Learning for Video-based Action Recognition
- URL: http://arxiv.org/abs/2202.12116v1
- Date: Thu, 24 Feb 2022 14:20:04 GMT
- Title: Slow-Fast Visual Tempo Learning for Video-based Action Recognition
- Authors: Yuanzhong Liu, Zhigang Tu, Hongyan Li, Chi Chen, Baoxin Li, Junsong
Yuan
- Abstract summary: Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
- Score: 78.3820439082979
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Action visual tempo characterizes the dynamics and the temporal scale of an
action, which is helpful to distinguish human actions that share high
similarities in visual dynamics and appearance. Previous methods capture the
visual tempo either by sampling raw videos with multiple rates, which requires
a costly multi-layer network to handle each rate, or by hierarchically sampling
backbone features, which relies heavily on high-level features that miss
fine-grained temporal dynamics. In this work, we propose a Temporal Correlation
Module (TCM), which can be easily embedded into the current action recognition
backbones in a plug-in-and-play manner, to extract action visual tempo from
low-level backbone features at single-layer remarkably. Specifically, our TCM
contains two main components: a Multi-scale Temporal Dynamics Module (MTDM) and
a Temporal Attention Module (TAM). MTDM applies a correlation operation to
learn pixel-wise fine-grained temporal dynamics for both fast-tempo and
slow-tempo. TAM adaptively emphasizes expressive features and suppresses
inessential ones via analyzing the global information across various tempos.
Extensive experiments conducted on several action recognition benchmarks, e.g.
Something-Something V1 & V2, Kinetics-400, UCF-101, and HMDB-51, have
demonstrated that the proposed TCM is effective to promote the performance of
the existing video-based action recognition models for a large margin. The
source code is publicly released at https://github.com/zphyix/TCM.
Related papers
- DyFADet: Dynamic Feature Aggregation for Temporal Action Detection [70.37707797523723]
We build a novel dynamic feature aggregation (DFA) module that can adapt kernel weights and receptive fields at different timestamps.
Using DFA helps to develop a Dynamic TAD head (DyHead), which adaptively aggregates the multi-scale features with adjusted parameters.
DyFADet, a new dynamic TAD model, achieves promising performance on a series of challenging TAD benchmarks.
arXiv Detail & Related papers (2024-07-03T15:29:10Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection [37.25262046781015]
Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos.
We propose a novel ConvTransformer network for action detection that efficiently captures both short-term and long-term temporal information.
Our network outperforms the state-of-the-art methods on all three datasets.
arXiv Detail & Related papers (2021-12-07T18:57:37Z) - Learning from Temporal Gradient for Semi-supervised Action Recognition [15.45239134477737]
We introduce temporal gradient as an additional modality for more attentive feature extraction.
Our method achieves the state-of-the-art performance on three video action recognition benchmarks.
arXiv Detail & Related papers (2021-11-25T20:30:30Z) - TSI: Temporal Saliency Integration for Video Action Recognition [32.18535820790586]
We propose a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-scale Temporal Integration (CTI) module.
SME aims to highlight the motion-sensitive area through local-global motion modeling.
CTI is designed to perform multi-scale temporal modeling through a group of separate 1D convolutions respectively.
arXiv Detail & Related papers (2021-06-02T11:43:49Z) - Learn to cycle: Time-consistent feature discovery for action recognition [83.43682368129072]
Generalizing over temporal variations is a prerequisite for effective action recognition in videos.
We introduce Squeeze Re Temporal Gates (SRTG), an approach that favors temporal activations with potential variations.
We show consistent improvement when using SRTPG blocks, with only a minimal increase in the number of GFLOs.
arXiv Detail & Related papers (2020-06-15T09:36:28Z) - TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map.
Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z) - Temporal Pyramid Network for Action Recognition [129.12076009042622]
We propose a generic Temporal Pyramid Network (TPN) at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks.
TPN shows consistent improvements over other challenging baselines on several action recognition datasets.
arXiv Detail & Related papers (2020-04-07T17:17:23Z) - TEA: Temporal Excitation and Aggregation for Action Recognition [31.076707274791957]
We propose a Temporal Excitation and Aggregation block, including a motion excitation module and a multiple temporal aggregation module.
For short-range motion modeling, the ME module calculates the feature-level temporal differences fromtemporal features.
The MTA module proposes to deform the local convolution to a group of sub-convolutions, forming a hierarchical residual architecture.
arXiv Detail & Related papers (2020-04-03T06:53:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.