FrameExit: Conditional Early Exiting for Efficient Video Recognition
- URL: http://arxiv.org/abs/2104.13400v1
- Date: Tue, 27 Apr 2021 18:01:05 GMT
- Title: FrameExit: Conditional Early Exiting for Efficient Video Recognition
- Authors: Amir Ghodrati, Babak Ehteshami Bejnordi, Amirhossein Habibian
- Abstract summary: We propose a conditional early exiting framework for efficient video recognition.
Our model learns to process fewer frames for simpler videos and more frames for complex ones.
Our method sets a new state of the art for efficient video understanding on the HVU benchmark.
- Score: 11.92976432364216
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a conditional early exiting framework for efficient
video recognition. While existing works focus on selecting a subset of salient
frames to reduce the computation costs, we propose to use a simple sampling
strategy combined with conditional early exiting to enable efficient
recognition. Our model automatically learns to process fewer frames for simpler
videos and more frames for complex ones. To achieve this, we employ a cascade
of gating modules to automatically determine the earliest point in processing
where an inference is sufficiently reliable. We generate on-the-fly supervision
signals to the gates to provide a dynamic trade-off between accuracy and
computational cost. Our proposed model outperforms competing methods on three
large-scale video benchmarks. In particular, on ActivityNet1.3 and
mini-kinetics, we outperform the state-of-the-art efficient video recognition
methods with 1.3$\times$ and 2.1$\times$ less GFLOPs, respectively.
Additionally, our method sets a new state of the art for efficient video
understanding on the HVU benchmark.
Related papers
- Magic 1-For-1: Generating One Minute Video Clips within One Minute [53.07214657235465]
We present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency.
By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics.
arXiv Detail & Related papers (2025-02-11T16:58:15Z) - Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models [26.866184981409607]
Current video models typically rely on heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B parameters)
Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders.
Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks.
arXiv Detail & Related papers (2024-12-24T18:59:56Z) - RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - View while Moving: Efficient Video Recognition in Long-untrimmed Videos [17.560160747282147]
We propose a novel recognition paradigm "View while Moving" for efficient long-untrimmed video recognition.
In contrast to the two-stage paradigm, our paradigm only needs to access the raw frame once.
Our approach outperforms state-of-the-art methods in terms of accuracy as well as efficiency.
arXiv Detail & Related papers (2023-08-09T09:46:26Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - Streaming Radiance Fields for 3D Video Synthesis [32.856346090347174]
We present an explicit-grid based method for reconstructing streaming radiance fields for novel view synthesis of real world dynamic scenes.
Experiments on challenging video sequences demonstrate that our approach is capable of achieving a training speed of 15 seconds per-frame with competitive rendering quality.
arXiv Detail & Related papers (2022-10-26T16:23:02Z) - NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition [89.84188594758588]
A novel Non-saliency Suppression Network (NSNet) is proposed to suppress the responses of non-salient frames.
NSNet achieves the state-of-the-art accuracy-efficiency trade-off and presents a significantly faster (2.44.3x) practical inference speed than state-of-the-art methods.
arXiv Detail & Related papers (2022-07-21T09:41:22Z) - OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.