What Can Simple Arithmetic Operations Do for Temporal Modeling?
- URL: http://arxiv.org/abs/2307.08908v2
- Date: Tue, 22 Aug 2023 14:10:06 GMT
- Title: What Can Simple Arithmetic Operations Do for Temporal Modeling?
- Authors: Wenhao Wu, Yuxin Song, Zhun Sun, Jingdong Wang, Chang Xu, Wanli Ouyang
- Abstract summary: Temporal modeling plays a crucial role in understanding video content.
Previous studies built complicated temporal relations through time sequence thanks to the development of powerful devices.
In this work, we explore the potential of four simple arithmetic operations for temporal modeling.
- Score: 100.39047523315662
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Temporal modeling plays a crucial role in understanding video content. To
tackle this problem, previous studies built complicated temporal relations
through time sequence thanks to the development of computationally powerful
devices. In this work, we explore the potential of four simple arithmetic
operations for temporal modeling. Specifically, we first capture auxiliary
temporal cues by computing addition, subtraction, multiplication, and division
between pairs of extracted frame features. Then, we extract corresponding
features from these cues to benefit the original temporal-irrespective domain.
We term such a simple pipeline as an Arithmetic Temporal Module (ATM), which
operates on the stem of a visual backbone with a plug-and-play style. We
conduct comprehensive ablation studies on the instantiation of ATMs and
demonstrate that this module provides powerful temporal modeling capability at
a low computational cost. Moreover, the ATM is compatible with both CNNs- and
ViTs-based architectures. Our results show that ATM achieves superior
performance over several popular video benchmarks. Specifically, on
Something-Something V1, V2 and Kinetics-400, we reach top-1 accuracy of 65.6%,
74.6%, and 89.4% respectively. The code is available at
https://github.com/whwu95/ATM.
Related papers
- No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding [38.60950616529459]
We propose to squeeze the time axis of a video sequence into the channel dimension and present a lightweight video recognition network, term as textitSqueezeTime, for mobile video understanding.
The proposed SqueezeTime is much lightweight and fast with high accuracies for mobile video understanding.
arXiv Detail & Related papers (2024-05-14T06:32:40Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - TSM: Temporal Shift Module for Efficient and Scalable Video
Understanding on Edge Device [58.776352999540435]
We propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance.
TSM is inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters.
It achieves a high frame rate of 74 fps and 29 fps for online video recognition on Jetson Nano and Galaxy Note8.
arXiv Detail & Related papers (2021-09-27T17:59:39Z) - SSAN: Separable Self-Attention Network for Video Representation Learning [11.542048296046524]
We propose a separable self-attention (SSA) module, which models spatial and temporal correlations sequentially.
By adding SSA module into 2D CNN, we build a SSA network (SSAN) for video representation learning.
Our approach outperforms state-of-the-art methods on Something-Something and Kinetics-400 datasets.
arXiv Detail & Related papers (2021-05-27T10:02:04Z) - Approximated Bilinear Modules for Temporal Modeling [116.6506871576514]
Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling.
Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
arXiv Detail & Related papers (2020-07-25T09:07:35Z) - TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map.
Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.