OCSampler: Compressing Videos to One Clip with Single-step Sampling
- URL: http://arxiv.org/abs/2201.04388v1
- Date: Wed, 12 Jan 2022 09:50:38 GMT
- Title: OCSampler: Compressing Videos to One Clip with Single-step Sampling
- Authors: Jintao Lin, Haodong Duan, Kai Chen, Dahua Lin, Limin Wang
- Abstract summary: We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
- Score: 82.0417131211353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a framework named OCSampler to explore a compact
yet effective video representation with one short clip for efficient video
recognition. Recent works prefer to formulate frame sampling as a sequential
decision task by selecting frames one by one according to their importance,
while we present a new paradigm of learning instance-specific video
condensation policies to select informative frames for representing the entire
video only in a single step. Our basic motivation is that the efficient video
recognition task lies in processing a whole sequence at once rather than
picking up frames sequentially. Accordingly, these policies are derived from a
light-weighted skim network together with a simple yet effective policy network
within one step. Moreover, we extend the proposed method with a frame number
budget, enabling the framework to produce correct predictions in high
confidence with as few frames as possible. Experiments on four benchmarks,
i.e., ActivityNet, Mini-Kinetics, FCVID, Mini-Sports1M, demonstrate the
effectiveness of our OCSampler over previous methods in terms of accuracy,
theoretical computational expense, actual inference speed. We also evaluate its
generalization power across different classifiers, sampled frames, and search
spaces. Especially, we achieve 76.9% mAP and 21.7 GFLOPs on ActivityNet with an
impressive throughput: 123.9 Videos/s on a single TITAN Xp GPU.
Related papers
- An Empirical Comparison of Video Frame Sampling Methods for Multi-Modal RAG Retrieval [1.6581184950812533]
We investigate the trade-offs in frame sampling methods for Video & Frame Retrieval using natural language questions.
Our study focuses on the storage and retrieval of image data (video frames) within a vector database required by Video RAG pattern.
arXiv Detail & Related papers (2024-07-22T11:44:08Z) - Search-Map-Search: A Frame Selection Paradigm for Action Recognition [21.395733318164393]
Frame selection aims to extract the most informative and representative frames to help a model better understand video content.
Existing frame selection methods either individually sample frames based on per-frame importance prediction, or adopt reinforcement learning agents to find representative frames in succession.
We propose a Search-Map-Search learning paradigm which combines the advantages of search and supervised learning to select the best combination of frames from a video as one entity.
arXiv Detail & Related papers (2023-04-20T13:49:53Z) - PMI Sampler: Patch Similarity Guided Frame Selection for Aerial Action
Recognition [52.78234467516168]
We introduce the concept of patch mutual information (PMI) score to quantify the motion bias between adjacent frames.
We present an adaptive frame selection strategy using shifted leaky ReLu and cumulative distribution function.
Our method achieves a relative improvement of 2.2 - 13.8% in top-1 accuracy on UAV-Human, 6.8% on NEC Drone, and 9.0% on Diving48 datasets.
arXiv Detail & Related papers (2023-04-14T00:01:11Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action
Recognition [25.888314212797436]
We propose a novel video frame sampler for few-shot action recognition.
Task-specific spatial-temporal frame sampling is achieved via a temporal selector (TS) and a spatial amplifier (SA)
Experiments show a significant boost on various benchmarks including long-term videos.
arXiv Detail & Related papers (2022-07-20T09:04:12Z) - Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference.
We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z) - FrameExit: Conditional Early Exiting for Efficient Video Recognition [11.92976432364216]
We propose a conditional early exiting framework for efficient video recognition.
Our model learns to process fewer frames for simpler videos and more frames for complex ones.
Our method sets a new state of the art for efficient video understanding on the HVU benchmark.
arXiv Detail & Related papers (2021-04-27T18:01:05Z) - No frame left behind: Full Video Action Recognition [26.37329995193377]
We propose full video action recognition and consider all video frames.
We first cluster all frame activations along the temporal dimension.
We then temporally aggregate the frames in the clusters into a smaller number of representations.
arXiv Detail & Related papers (2021-03-29T07:44:28Z) - Video Instance Segmentation with a Propose-Reduce Paradigm [68.59137660342326]
Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos.
Prior methods usually obtain segmentation for a frame or clip first, and then merge the incomplete results by tracking or matching.
We propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step.
arXiv Detail & Related papers (2021-03-25T10:58:36Z) - Scene-Adaptive Video Frame Interpolation via Meta-Learning [54.87696619177496]
We propose to adapt the model to each video by making use of additional information that is readily available at test time.
We obtain significant performance gains with only a single gradient update without any additional parameters.
arXiv Detail & Related papers (2020-04-02T02:46:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.