Related papers: FrameExit: Conditional Early Exiting for Efficient Video Recognition

FrameExit: Conditional Early Exiting for Efficient Video Recognition

URL: http://arxiv.org/abs/2104.13400v1
Date: Tue, 27 Apr 2021 18:01:05 GMT
Title: FrameExit: Conditional Early Exiting for Efficient Video Recognition
Authors: Amir Ghodrati, Babak Ehteshami Bejnordi, Amirhossein Habibian
Abstract summary: We propose a conditional early exiting framework for efficient video recognition. Our model learns to process fewer frames for simpler videos and more frames for complex ones. Our method sets a new state of the art for efficient video understanding on the HVU benchmark.
Score: 11.92976432364216
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we propose a conditional early exiting framework for efficient video recognition. While existing works focus on selecting a subset of salient frames to reduce the computation costs, we propose to use a simple sampling strategy combined with conditional early exiting to enable efficient recognition. Our model automatically learns to process fewer frames for simpler videos and more frames for complex ones. To achieve this, we employ a cascade of gating modules to automatically determine the earliest point in processing where an inference is sufficiently reliable. We generate on-the-fly supervision signals to the gates to provide a dynamic trade-off between accuracy and computational cost. Our proposed model outperforms competing methods on three large-scale video benchmarks. In particular, on ActivityNet1.3 and mini-kinetics, we outperform the state-of-the-art efficient video recognition methods with 1.3$\times$ and 2.1$\times$ less GFLOPs, respectively. Additionally, our method sets a new state of the art for efficient video understanding on the HVU benchmark.

Related papers

Towards Scalable Modeling of Compressed Videos for Efficient Action Recognition [6.168286187549952]
We propose a hybrid end-to-end framework that factorizes learning across three key concepts to reduce inference cost by $330times$ versus prior art. Experiments show that our method results in a lightweight architecture achieving state-of-the-art video recognition performance.
arXiv Detail & Related papers (2025-03-17T21:13:48Z)
Magic 1-For-1: Generating One Minute Video Clips within One Minute [53.07214657235465]
We present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics.
arXiv Detail & Related papers (2025-02-11T16:58:15Z)
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models [26.866184981409607]
We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders. Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks.
arXiv Detail & Related papers (2024-12-24T18:59:56Z)
RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies. Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks. Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z)
View while Moving: Efficient Video Recognition in Long-untrimmed Videos [17.560160747282147]
We propose a novel recognition paradigm "View while Moving" for efficient long-untrimmed video recognition. In contrast to the two-stage paradigm, our paradigm only needs to access the raw frame once. Our approach outperforms state-of-the-art methods in terms of accuracy as well as efficiency.
arXiv Detail & Related papers (2023-08-09T09:46:26Z)
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query. Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames. We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z)
Deep Unsupervised Key Frame Extraction for Efficient Video Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC) The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically. Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z)
Streaming Radiance Fields for 3D Video Synthesis [32.856346090347174]
We present an explicit-grid based method for reconstructing streaming radiance fields for novel view synthesis of real world dynamic scenes. Experiments on challenging video sequences demonstrate that our approach is capable of achieving a training speed of 15 seconds per-frame with competitive rendering quality.
arXiv Detail & Related papers (2022-10-26T16:23:02Z)
NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition [89.84188594758588]
A novel Non-saliency Suppression Network (NSNet) is proposed to suppress the responses of non-salient frames. NSNet achieves the state-of-the-art accuracy-efficiency trade-off and presents a significantly faster (2.44.3x) practical inference speed than state-of-the-art methods.
arXiv Detail & Related papers (2022-07-21T09:41:22Z)
OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip. Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z)
Adaptive Compact Attention For Few-shot Video-to-video Translation [13.535988102579918]
We introduce a novel adaptive compact attention mechanism to efficiently extract contextual features jointly from multiple reference images. Our core idea is to extract compact basis sets from all the reference images as higher-level representations. We extensively evaluate our method on a large-scale talking-head video dataset and a human dancing dataset.
arXiv Detail & Related papers (2020-11-30T11:19:12Z)
A Real-time Action Representation with Temporal Encoding and Deep Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.