DSANet: Dynamic Segment Aggregation Network for Video-Level
Representation Learning
- URL: http://arxiv.org/abs/2105.12085v1
- Date: Tue, 25 May 2021 17:09:57 GMT
- Title: DSANet: Dynamic Segment Aggregation Network for Video-Level
Representation Learning
- Authors: Wenhao Wu, Yuxiang Zhao, Yanwu Xu, Xiao Tan, Dongliang He, Zhikang
Zou, Jin Ye, Yingying Li, Mingde Yao, Zichao Dong, Yifeng Shi
- Abstract summary: We coined Kinetic-range and short-range temporal modeling as key aspects of video recognition.
In this paper, we introduce Dynamic Segment Aggregation (DSA) module to capture relationship among snippets.
Our proposed DSA module is shown to benefit various video recognition models significantly.
- Score: 29.182482776910152
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long-range and short-range temporal modeling are two complementary and
crucial aspects of video recognition. Most of the state-of-the-arts focus on
short-range spatio-temporal modeling and then average multiple snippet-level
predictions to yield the final video-level prediction. Thus, their video-level
prediction does not consider spatio-temporal features of how video evolves
along the temporal dimension. In this paper, we introduce a novel Dynamic
Segment Aggregation (DSA) module to capture relationship among snippets. To be
more specific, we attempt to generate a dynamic kernel for a convolutional
operation to aggregate long-range temporal information among adjacent snippets
adaptively. The DSA module is an efficient plug-and-play module and can be
combined with the off-the-shelf clip-based models (i.e., TSM, I3D) to perform
powerful long-range modeling with minimal overhead. The final video
architecture, coined as DSANet. We conduct extensive experiments on several
video recognition benchmarks (i.e., Mini-Kinetics-200, Kinetics-400,
Something-Something V1 and ActivityNet) to show its superiority. Our proposed
DSA module is shown to benefit various video recognition models significantly.
For example, equipped with DSA modules, the top-1 accuracy of I3D ResNet-50 is
improved from 74.9% to 78.2% on Kinetics-400. Codes will be available.
Related papers
- No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding [38.60950616529459]
We propose to squeeze the time axis of a video sequence into the channel dimension and present a lightweight video recognition network, term as textitSqueezeTime, for mobile video understanding.
The proposed SqueezeTime is much lightweight and fast with high accuracies for mobile video understanding.
arXiv Detail & Related papers (2024-05-14T06:32:40Z) - Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z) - Approximated Bilinear Modules for Temporal Modeling [116.6506871576514]
Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling.
Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
arXiv Detail & Related papers (2020-07-25T09:07:35Z) - TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map.
Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.