An end-to-end multi-scale network for action prediction in videos
- URL: http://arxiv.org/abs/2301.01216v1
- Date: Sat, 31 Dec 2022 06:58:41 GMT
- Title: An end-to-end multi-scale network for action prediction in videos
- Authors: Xiaofa Liu, Jianqin Yin, Yuan Sun, Zhicheng Zhang, Jin Tang
- Abstract summary: We develop an efficient multi-scale network to predict action classes in partial videos in an end-to-end manner.
Our E2EMSNet is evaluated on three challenging datasets: BIT, HMDB51, and UCF101.
- Score: 31.967024536359908
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we develop an efficient multi-scale network to predict action
classes in partial videos in an end-to-end manner. Unlike most existing methods
with offline feature generation, our method directly takes frames as input and
further models motion evolution on two different temporal scales.Therefore, we
solve the complexity problems of the two stages of modeling and the problem of
insufficient temporal and spatial information of a single scale. Our proposed
End-to-End MultiScale Network (E2EMSNet) is composed of two scales which are
named segment scale and observed global scale. The segment scale leverages
temporal difference over consecutive frames for finer motion patterns by
supplying 2D convolutions. For observed global scale, a Long Short-Term Memory
(LSTM) is incorporated to capture motion features of observed frames. Our model
provides a simple and efficient modeling framework with a small computational
cost. Our E2EMSNet is evaluated on three challenging datasets: BIT, HMDB51, and
UCF101. The extensive experiments demonstrate the effectiveness of our method
for action prediction in videos.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Unlocking the Secrets of Linear Complexity Sequence Model from A Unified Perspective [26.479602180023125]
The Linear Complexity Sequence Model (LCSM) unites various sequence modeling techniques with linear complexity.
We segment the modeling processes of these models into three distinct stages: Expand, Oscillation, and Shrink.
We perform experiments to analyze the impact of different stage settings on language modeling and retrieval tasks.
arXiv Detail & Related papers (2024-05-27T17:38:55Z) - Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - TSI: Temporal Saliency Integration for Video Action Recognition [32.18535820790586]
We propose a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-scale Temporal Integration (CTI) module.
SME aims to highlight the motion-sensitive area through local-global motion modeling.
CTI is designed to perform multi-scale temporal modeling through a group of separate 1D convolutions respectively.
arXiv Detail & Related papers (2021-06-02T11:43:49Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z) - Approximated Bilinear Modules for Temporal Modeling [116.6506871576514]
Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling.
Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
arXiv Detail & Related papers (2020-07-25T09:07:35Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.