Approximated Bilinear Modules for Temporal Modeling
- URL: http://arxiv.org/abs/2007.12887v1
- Date: Sat, 25 Jul 2020 09:07:35 GMT
- Title: Approximated Bilinear Modules for Temporal Modeling
- Authors: Xinqi Zhu and Chang Xu and Langwen Hui and Cewu Lu and Dacheng Tao
- Abstract summary: Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling.
Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
- Score: 116.6506871576514
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider two less-emphasized temporal properties of video: 1. Temporal
cues are fine-grained; 2. Temporal modeling needs reasoning. To tackle both
problems at once, we exploit approximated bilinear modules (ABMs) for temporal
modeling. There are two main points making the modules effective: two-layer
MLPs can be seen as a constraint approximation of bilinear operations, thus can
be used to construct deep ABMs in existing CNNs while reusing pretrained
parameters; frame features can be divided into static and dynamic parts because
of visual repetition in adjacent frames, which enables temporal modeling to be
more efficient. Multiple ABM variants and implementations are investigated,
from high performance to high efficiency. Specifically, we show how two-layer
subnets in CNNs can be converted to temporal bilinear modules by adding an
auxiliary-branch. Besides, we introduce snippet sampling and shifting inference
to boost sparse-frame video classification performance. Extensive ablation
studies are conducted to show the effectiveness of proposed techniques. Our
models can outperform most state-of-the-art methods on Something-Something v1
and v2 datasets without Kinetics pretraining, and are also competitive on other
YouTube-like action recognition datasets. Our code is available on
https://github.com/zhuxinqimac/abm-pytorch.
Related papers
- What Can Simple Arithmetic Operations Do for Temporal Modeling? [100.39047523315662]
Temporal modeling plays a crucial role in understanding video content.
Previous studies built complicated temporal relations through time sequence thanks to the development of powerful devices.
In this work, we explore the potential of four simple arithmetic operations for temporal modeling.
arXiv Detail & Related papers (2023-07-18T00:48:56Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - TDN: Temporal Difference Networks for Efficient Action Recognition [31.922001043405924]
This paper presents a new video architecture, termed as Temporal Difference Network (TDN)
The core of our TDN is to devise an efficient temporal module (TDM) by explicitly leveraging a temporal difference operator.
Our TDN presents a new state of the art on the Something-Something V1 & V2 datasets and is on par with the best performance on the Kinetics-400 dataset.
arXiv Detail & Related papers (2020-12-18T06:31:08Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z) - Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization
for Efficient Video Classification [12.787763599624173]
We propose an efficient temporal modeling 3D architecture, called VoV3D, that consists of a temporal one-shot aggregation (T-OSA) module and depthwise factorized component, D(2+1)D.
Thanks to its efficiency and effectiveness of temporal modeling, VoV3D-L has 6x fewer model parameters and 16x less computation, surpassing a state-of-the-art temporal modeling method on both SomethingSomething and Kinetics.
arXiv Detail & Related papers (2020-12-01T07:40:06Z) - TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map.
Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.