Related papers: TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Device

TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Device

URL: http://arxiv.org/abs/2109.13227v1
Date: Mon, 27 Sep 2021 17:59:39 GMT
Title: TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Device
Authors: Ji Lin, Chuang Gan, Kuan Wang, Song Han
Abstract summary: We propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance. TSM is inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters. It achieves a high frame rate of 74 fps and 29 fps for online video recognition on Jetson Nano and Galaxy Note8.
Score: 58.776352999540435
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The explosive growth in video streaming requires video understanding at high accuracy and low computation cost. Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN-based methods can achieve good performance but are computationally intensive. In this paper, we propose a generic and effective Temporal Shift Module (TSM) that enjoys both high efficiency and high performance. The key idea of TSM is to shift part of the channels along the temporal dimension, thus facilitate information exchanged among neighboring frames. It can be inserted into 2D CNNs to achieve temporal modeling at zero computation and zero parameters. TSM offers several unique advantages. Firstly, TSM has high performance; it ranks the first on the Something-Something leaderboard upon submission. Secondly, TSM has high efficiency; it achieves a high frame rate of 74fps and 29fps for online video recognition on Jetson Nano and Galaxy Note8. Thirdly, TSM has higher scalability compared to 3D networks, enabling large-scale Kinetics training on 1,536 GPUs in 15 minutes. Lastly, TSM enables action concepts learning, which 2D networks cannot model; we visualize the category attention map and find that spatial-temporal action detector emerges during the training of classification tasks. The code is publicly available at https://github.com/mit-han-lab/temporal-shift-module.

Related papers

What Can Simple Arithmetic Operations Do for Temporal Modeling? [100.39047523315662]
Temporal modeling plays a crucial role in understanding video content. Previous studies built complicated temporal relations through time sequence thanks to the development of powerful devices. In this work, we explore the potential of four simple arithmetic operations for temporal modeling.
arXiv Detail & Related papers (2023-07-18T00:48:56Z)
Gate-Shift-Fuse for Video Action Recognition [43.8525418821458]
Gate-Fuse (GSF) is a novel-temporal feature extraction module which controls interactions in-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner. GSF can be inserted into existing 2D CNNs to convert them into efficient and high performing, with negligible parameter and compute overhead. We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks.
arXiv Detail & Related papers (2022-03-16T19:19:04Z)
STSM: Spatio-Temporal Shift Module for Efficient Action Recognition [4.096670184726871]
We propose a plug-and-play Spatio-temporal Shift Module (STSM) that is both effective and high-performance. In particular, when the network is 2D CNNs, our STSM module allows the network to learn efficient Spatio-temporal features.
arXiv Detail & Related papers (2021-12-05T09:40:49Z)
MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency. MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z)
3D CNNs with Adaptive Temporal Feature Resolutions [83.43776851586351]
Similarity Guided Sampling (SGS) module can be plugged into any existing 3D CNN architecture. SGS empowers 3D CNNs by learning the similarity of temporal features and grouping similar features together. Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by half while preserving or even improving the accuracy.
arXiv Detail & Related papers (2020-11-17T14:34:05Z)
Approximated Bilinear Modules for Temporal Modeling [116.6506871576514]
Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling. Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
arXiv Detail & Related papers (2020-07-25T09:07:35Z)
RT3D: Achieving Real-Time Execution of 3D Convolutional Neural Networks on Mobile Devices [57.877112704841366]
This paper proposes RT3D, a model compression and mobile acceleration framework for 3D CNNs. For the first time, real-time execution of 3D CNNs is achieved on off-the-shelf mobiles.
arXiv Detail & Related papers (2020-07-20T02:05:32Z)
STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition [39.58542259261567]
We present a novel S-Temporal Hybrid Network (STH) which simultaneously encodes spatial and temporal video information with a small parameter. Such a design enables efficient-temporal modeling and maintains a small model scale. STH enjoys performance superiority over 3D CNNs while maintaining an even smaller parameter cost than 2D CNNs.
arXiv Detail & Related papers (2020-03-18T04:46:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.