Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization
for Efficient Video Classification
- URL: http://arxiv.org/abs/2012.00317v3
- Date: Thu, 22 Apr 2021 01:40:41 GMT
- Title: Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization
for Efficient Video Classification
- Authors: Youngwan Lee, Hyung-Il Kim, Kimin Yun, Jinyoung Moon
- Abstract summary: We propose an efficient temporal modeling 3D architecture, called VoV3D, that consists of a temporal one-shot aggregation (T-OSA) module and depthwise factorized component, D(2+1)D.
Thanks to its efficiency and effectiveness of temporal modeling, VoV3D-L has 6x fewer model parameters and 16x less computation, surpassing a state-of-the-art temporal modeling method on both SomethingSomething and Kinetics.
- Score: 12.787763599624173
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video classification researches that have recently attracted attention are
the fields of temporal modeling and 3D efficient architecture. However, the
temporal modeling methods are not efficient or the 3D efficient architecture is
less interested in temporal modeling. For bridging the gap between them, we
propose an efficient temporal modeling 3D architecture, called VoV3D, that
consists of a temporal one-shot aggregation (T-OSA) module and depthwise
factorized component, D(2+1)D. The T-OSA is devised to build a feature
hierarchy by aggregating temporal features with different temporal receptive
fields. Stacking this T-OSA enables the network itself to model short-range as
well as long-range temporal relationships across frames without any external
modules. Inspired by kernel factorization and channel factorization, we also
design a depthwise spatiotemporal factorization module, named, D(2+1)D that
decomposes a 3D depthwise convolution into two spatial and temporal depthwise
convolutions for making our network more lightweight and efficient. By using
the proposed temporal modeling method (T-OSA), and the efficient factorized
component (D(2+1)D), we construct two types of VoV3D networks, VoV3D-M and
VoV3D-L. Thanks to its efficiency and effectiveness of temporal modeling,
VoV3D-L has 6x fewer model parameters and 16x less computation, surpassing a
state-of-the-art temporal modeling method on both Something-Something and
Kinetics-400. Furthermore, VoV3D shows better temporal modeling ability than a
state-of-the-art efficient 3D architecture, X3D having comparable model
capacity. We hope that VoV3D can serve as a baseline for efficient video
classification.
Related papers
- RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - Deepfake Detection: Leveraging the Power of 2D and 3D CNN Ensembles [0.0]
This work presents an innovative approach to validate video content.
The methodology blends advanced 2-dimensional and 3-dimensional Convolutional Neural Networks.
Experimental validation underscores the effectiveness of this strategy, showcasing its potential in countering deepfakes generation.
arXiv Detail & Related papers (2023-10-25T06:00:37Z) - Spatiotemporal Modeling Encounters 3D Medical Image Analysis:
Slice-Shift UNet with Multi-View Fusion [0.0]
We propose a new 2D-based model dubbed Slice SHift UNet which encodes three-dimensional features at 2D CNN's complexity.
More precisely multi-view features are collaboratively learned by performing 2D convolutions along the three planes of a volume.
The effectiveness of our approach is validated in Multi-Modality Abdominal Multi-Organ axis (AMOS) and Multi-Atlas Labeling Beyond the Cranial Vault (BTCV) datasets.
arXiv Detail & Related papers (2023-07-24T14:53:23Z) - Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision.
This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z) - Efficient Spatialtemporal Context Modeling for Action Recognition [42.30158166919919]
We propose a recurrent 3D criss-cross attention (RCCA-3D) module to model the dense long-range contextual information video for action recognition.
We model the relationship between points in the same line along the direction of horizon, vertical and depth at each time, which forms a 3D criss-cross structure.
Compared with the non-local method, the proposed RCCA-3D module reduces the number of parameters and FLOPs by 25% and 11% for the video context modeling.
arXiv Detail & Related papers (2021-03-20T14:48:12Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z) - Approximated Bilinear Modules for Temporal Modeling [116.6506871576514]
Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling.
Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
arXiv Detail & Related papers (2020-07-25T09:07:35Z) - TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map.
Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z) - STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition [39.58542259261567]
We present a novel S-Temporal Hybrid Network (STH) which simultaneously encodes spatial and temporal video information with a small parameter.
Such a design enables efficient-temporal modeling and maintains a small model scale.
STH enjoys performance superiority over 3D CNNs while maintaining an even smaller parameter cost than 2D CNNs.
arXiv Detail & Related papers (2020-03-18T04:46:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.