Dynamic Inference: A New Approach Toward Efficient Video Action
Recognition
- URL: http://arxiv.org/abs/2002.03342v1
- Date: Sun, 9 Feb 2020 11:09:56 GMT
- Title: Dynamic Inference: A New Approach Toward Efficient Video Action
Recognition
- Authors: Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, Yi Yang, Shilei Wen
- Abstract summary: Action recognition in videos has achieved great success recently, but it remains a challenging task due to the massive computational cost.
We propose a general dynamic inference idea to improve inference efficiency by leveraging the variation in the distinguishability of different videos.
- Score: 69.9658249941149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Though action recognition in videos has achieved great success recently, it
remains a challenging task due to the massive computational cost. Designing
lightweight networks is a possible solution, but it may degrade the recognition
performance. In this paper, we innovatively propose a general dynamic inference
idea to improve inference efficiency by leveraging the variation in the
distinguishability of different videos. The dynamic inference approach can be
achieved from aspects of the network depth and the number of input video
frames, or even in a joint input-wise and network depth-wise manner. In a
nutshell, we treat input frames and network depth of the computational graph as
a 2-dimensional grid, and several checkpoints are placed on this grid in
advance with a prediction module. The inference is carried out progressively on
the grid by following some predefined route, whenever the inference process
comes across a checkpoint, an early prediction can be made depending on whether
the early stop criteria meets. For the proof-of-concept purpose, we instantiate
three dynamic inference frameworks using two well-known backbone CNNs. In these
instances, we overcome the drawback of limited temporal coverage resulted from
an early prediction by a novel frame permutation scheme, and alleviate the
conflict between progressive computation and video temporal relation modeling
by introducing an online temporal shift module. Extensive experiments are
conducted to thoroughly analyze the effectiveness of our ideas and to inspire
future research efforts. Results on various datasets also evident the
superiority of our approach.
Related papers
- Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - Modeling Temporal Concept Receptive Field Dynamically for Untrimmed
Video Analysis [105.06166692486674]
We study temporal concept receptive field of concept-based event representation.
We introduce temporal dynamic convolution (TDC) to give stronger flexibility to concept-based event analytics.
Different coefficients can generate appropriate and accurate temporal concept receptive field size according to input videos.
arXiv Detail & Related papers (2021-11-23T04:59:48Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Dynamic Network Quantization for Efficient Video Inference [60.109250720206425]
We propose a dynamic network quantization framework, that selects optimal precision for each frame conditioned on the input for efficient video recognition.
We train both networks effectively using standard backpropagation with a loss to achieve both competitive performance and resource efficiency.
arXiv Detail & Related papers (2021-08-23T20:23:57Z) - CDN-MEDAL: Two-stage Density and Difference Approximation Framework for
Motion Analysis [3.337126420148156]
We propose a novel, two-stage method of change detection with two convolutional neural networks.
Our two-stage framework contains approximately 3.5K parameters in total but still maintains rapid convergence to intricate motion patterns.
arXiv Detail & Related papers (2021-06-07T16:39:42Z) - TrackMPNN: A Message Passing Graph Neural Architecture for Multi-Object
Tracking [8.791710193028903]
This study follows many previous approaches to multi-object tracking (MOT) that model the problem using graph-based data structures.
We create a framework based on dynamic undirected graphs that represent the data association problem over multiple timesteps.
We also provide solutions and propositions for the computational problems that need to be addressed to create a memory-efficient, real-time, online algorithm.
arXiv Detail & Related papers (2021-01-11T21:52:25Z) - Improving Video Instance Segmentation by Light-weight Temporal
Uncertainty Estimates [11.580916951856256]
We present a time-dynamic approach to model uncertainties of instance segmentation networks.
We apply this approach to the detection of false positives and the estimation of prediction quality.
The proposed method only requires a readily trained neural network and video sequence input.
arXiv Detail & Related papers (2020-12-14T13:39:05Z) - A Deep-Unfolded Reference-Based RPCA Network For Video
Foreground-Background Separation [86.35434065681925]
This paper proposes a new deep-unfolding-based network design for the problem of Robust Principal Component Analysis (RPCA)
Unlike existing designs, our approach focuses on modeling the temporal correlation between the sparse representations of consecutive video frames.
Experimentation using the moving MNIST dataset shows that the proposed network outperforms a recently proposed state-of-the-art RPCA network in the task of video foreground-background separation.
arXiv Detail & Related papers (2020-10-02T11:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.