MVFNet: Multi-View Fusion Network for Efficient Video Recognition
- URL: http://arxiv.org/abs/2012.06977v2
- Date: Tue, 5 Jan 2021 06:09:48 GMT
- Title: MVFNet: Multi-View Fusion Network for Efficient Video Recognition
- Authors: Wenhao Wu, Dongliang He, Tianwei Lin, Fu Li, Chuang Gan, Errui Ding
- Abstract summary: We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
- Score: 79.92736306354576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventionally, spatiotemporal modeling network and its complexity are the
two most concentrated research topics in video action recognition. Existing
state-of-the-art methods have achieved excellent accuracy regardless of the
complexity meanwhile efficient spatiotemporal modeling solutions are slightly
inferior in performance. In this paper, we attempt to acquire both efficiency
and effectiveness simultaneously. First of all, besides traditionally treating
H x W x T video frames as space-time signal (viewing from the Height-Width
spatial plane), we propose to also model video from the other two Height-Time
and Width-Time planes, to capture the dynamics of video thoroughly. Secondly,
our model is designed based on 2D CNN backbones and model complexity is well
kept in mind by design. Specifically, we introduce a novel multi-view fusion
(MVF) module to exploit video dynamics using separable convolution for
efficiency. It is a plug-and-play module and can be inserted into off-the-shelf
2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet
can be thought of as a generalized video modeling framework and it can
specialize to be existing methods such as C2D, SlowOnly, and TSM under
different settings. Extensive experiments are conducted on popular benchmarks
(i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its
superiority. The proposed MVFNet can achieve state-of-the-art performance with
2D CNN's complexity.
Related papers
- Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z) - Searching for Two-Stream Models in Multivariate Space for Video
Recognition [80.25356538056839]
We present a pragmatic neural architecture search approach, which is able to search for two-stream video models in giant spaces efficiently.
We demonstrate two-stream models with significantly better performance can be automatically discovered in our design space.
arXiv Detail & Related papers (2021-08-30T02:03:28Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - VideoLightFormer: Lightweight Action Recognition using Transformers [8.871042314510788]
We propose a novel, lightweight action recognition architecture, VideoLightFormer.
In a factorized fashion, we carefully extend the 2D convolutional Temporal Network with transformers.
We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-SV-V-Something2 datasets.
arXiv Detail & Related papers (2021-07-01T13:55:52Z) - Approximated Bilinear Modules for Temporal Modeling [116.6506871576514]
Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling.
Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
arXiv Detail & Related papers (2020-07-25T09:07:35Z) - TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map.
Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z) - STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition [39.58542259261567]
We present a novel S-Temporal Hybrid Network (STH) which simultaneously encodes spatial and temporal video information with a small parameter.
Such a design enables efficient-temporal modeling and maintains a small model scale.
STH enjoys performance superiority over 3D CNNs while maintaining an even smaller parameter cost than 2D CNNs.
arXiv Detail & Related papers (2020-03-18T04:46:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.