FrameExit: Conditional Early Exiting for Efficient Video Recognition
- URL: http://arxiv.org/abs/2104.13400v1
- Date: Tue, 27 Apr 2021 18:01:05 GMT
- Title: FrameExit: Conditional Early Exiting for Efficient Video Recognition
- Authors: Amir Ghodrati, Babak Ehteshami Bejnordi, Amirhossein Habibian
- Abstract summary: We propose a conditional early exiting framework for efficient video recognition.
Our model learns to process fewer frames for simpler videos and more frames for complex ones.
Our method sets a new state of the art for efficient video understanding on the HVU benchmark.
- Score: 11.92976432364216
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a conditional early exiting framework for efficient
video recognition. While existing works focus on selecting a subset of salient
frames to reduce the computation costs, we propose to use a simple sampling
strategy combined with conditional early exiting to enable efficient
recognition. Our model automatically learns to process fewer frames for simpler
videos and more frames for complex ones. To achieve this, we employ a cascade
of gating modules to automatically determine the earliest point in processing
where an inference is sufficiently reliable. We generate on-the-fly supervision
signals to the gates to provide a dynamic trade-off between accuracy and
computational cost. Our proposed model outperforms competing methods on three
large-scale video benchmarks. In particular, on ActivityNet1.3 and
mini-kinetics, we outperform the state-of-the-art efficient video recognition
methods with 1.3$\times$ and 2.1$\times$ less GFLOPs, respectively.
Additionally, our method sets a new state of the art for efficient video
understanding on the HVU benchmark.
Related papers
- RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - View while Moving: Efficient Video Recognition in Long-untrimmed Videos [17.560160747282147]
We propose a novel recognition paradigm "View while Moving" for efficient long-untrimmed video recognition.
In contrast to the two-stage paradigm, our paradigm only needs to access the raw frame once.
Our approach outperforms state-of-the-art methods in terms of accuracy as well as efficiency.
arXiv Detail & Related papers (2023-08-09T09:46:26Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - Streaming Radiance Fields for 3D Video Synthesis [32.856346090347174]
We present an explicit-grid based method for reconstructing streaming radiance fields for novel view synthesis of real world dynamic scenes.
Experiments on challenging video sequences demonstrate that our approach is capable of achieving a training speed of 15 seconds per-frame with competitive rendering quality.
arXiv Detail & Related papers (2022-10-26T16:23:02Z) - NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition [89.84188594758588]
A novel Non-saliency Suppression Network (NSNet) is proposed to suppress the responses of non-salient frames.
NSNet achieves the state-of-the-art accuracy-efficiency trade-off and presents a significantly faster (2.44.3x) practical inference speed than state-of-the-art methods.
arXiv Detail & Related papers (2022-07-21T09:41:22Z) - OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z) - Adaptive Compact Attention For Few-shot Video-to-video Translation [13.535988102579918]
We introduce a novel adaptive compact attention mechanism to efficiently extract contextual features jointly from multiple reference images.
Our core idea is to extract compact basis sets from all the reference images as higher-level representations.
We extensively evaluate our method on a large-scale talking-head video dataset and a human dancing dataset.
arXiv Detail & Related papers (2020-11-30T11:19:12Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.