Related papers: Approximated Bilinear Modules for Temporal Modeling

Approximated Bilinear Modules for Temporal Modeling

URL: http://arxiv.org/abs/2007.12887v1
Date: Sat, 25 Jul 2020 09:07:35 GMT
Title: Approximated Bilinear Modules for Temporal Modeling
Authors: Xinqi Zhu and Chang Xu and Langwen Hui and Cewu Lu and Dacheng Tao
Abstract summary: Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling. Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
Score: 116.6506871576514
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider two less-emphasized temporal properties of video: 1. Temporal cues are fine-grained; 2. Temporal modeling needs reasoning. To tackle both problems at once, we exploit approximated bilinear modules (ABMs) for temporal modeling. There are two main points making the modules effective: two-layer MLPs can be seen as a constraint approximation of bilinear operations, thus can be used to construct deep ABMs in existing CNNs while reusing pretrained parameters; frame features can be divided into static and dynamic parts because of visual repetition in adjacent frames, which enables temporal modeling to be more efficient. Multiple ABM variants and implementations are investigated, from high performance to high efficiency. Specifically, we show how two-layer subnets in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch. Besides, we introduce snippet sampling and shifting inference to boost sparse-frame video classification performance. Extensive ablation studies are conducted to show the effectiveness of proposed techniques. Our models can outperform most state-of-the-art methods on Something-Something v1 and v2 datasets without Kinetics pretraining, and are also competitive on other YouTube-like action recognition datasets. Our code is available on https://github.com/zhuxinqimac/abm-pytorch.

Related papers

What Can Simple Arithmetic Operations Do for Temporal Modeling? [100.39047523315662]
Temporal modeling plays a crucial role in understanding video content. Previous studies built complicated temporal relations through time sequence thanks to the development of powerful devices. In this work, we explore the potential of four simple arithmetic operations for temporal modeling.
arXiv Detail & Related papers (2023-07-18T00:48:56Z)
Implicit Temporal Modeling with Learnable Alignment for Video Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance. ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z)
Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action. Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features. We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z)
TDN: Temporal Difference Networks for Efficient Action Recognition [31.922001043405924]
This paper presents a new video architecture, termed as Temporal Difference Network (TDN) The core of our TDN is to devise an efficient temporal module (TDM) by explicitly leveraging a temporal difference operator. Our TDN presents a new state of the art on the Something-Something V1 & V2 datasets and is on par with the best performance on the Kinetics-400 dataset.
arXiv Detail & Related papers (2020-12-18T06:31:08Z)
MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency. MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z)
Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification [12.787763599624173]
We propose an efficient temporal modeling 3D architecture, called VoV3D, that consists of a temporal one-shot aggregation (T-OSA) module and depthwise factorized component, D(2+1)D. Thanks to its efficiency and effectiveness of temporal modeling, VoV3D-L has 6x fewer model parameters and 16x less computation, surpassing a state-of-the-art temporal modeling method on both SomethingSomething and Kinetics.
arXiv Detail & Related papers (2020-12-01T07:40:06Z)
TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map. Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.