Deep Unsupervised Key Frame Extraction for Efficient Video
Classification
- URL: http://arxiv.org/abs/2211.06742v1
- Date: Sat, 12 Nov 2022 20:45:35 GMT
- Title: Deep Unsupervised Key Frame Extraction for Efficient Video
Classification
- Authors: Hao Tang, Lei Ding, Songsong Wu, Bin Ren, Nicu Sebe, Paolo Rota
- Abstract summary: This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
- Score: 63.25852915237032
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video processing and analysis have become an urgent task since a huge amount
of videos (e.g., Youtube, Hulu) are uploaded online every day. The extraction
of representative key frames from videos is very important in video processing
and analysis since it greatly reduces computing resources and time. Although
great progress has been made recently, large-scale video classification remains
an open problem, as the existing methods have not well balanced the performance
and efficiency simultaneously. To tackle this problem, this work presents an
unsupervised method to retrieve the key frames, which combines Convolutional
Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC). The
proposed TSDPC is a generic and powerful framework and it has two advantages
compared with previous works, one is that it can calculate the number of key
frames automatically. The other is that it can preserve the temporal
information of the video. Thus it improves the efficiency of video
classification. Furthermore, a Long Short-Term Memory network (LSTM) is added
on the top of the CNN to further elevate the performance of classification.
Moreover, a weight fusion strategy of different input networks is presented to
boost the performance. By optimizing both video classification and key frame
extraction simultaneously, we achieve better classification performance and
higher efficiency. We evaluate our method on two popular datasets (i.e., HMDB51
and UCF101) and the experimental results consistently demonstrate that our
strategy achieves competitive performance and efficiency compared with the
state-of-the-art approaches.
Related papers
- KeyVideoLLM: Towards Large-scale Video Keyframe Selection [38.39013577942218]
KeyVideoLLM is a text-video frame similarity-based selection method designed to manage VideoLLM data efficiently.
It achieves a remarkable data compression rate of up to 60.9 times, substantially lowering disk space requirements.
It enhances processing speed by up to 200 times compared to existing selection methods.
arXiv Detail & Related papers (2024-07-03T13:41:44Z) - ReBotNet: Fast Real-time Video Enhancement [59.08038313427057]
Most restoration networks are slow, have high computational bottleneck, and can't be used for real-time video enhancement.
In this work, we design an efficient and fast framework to perform real-time enhancement for practical use-cases like live video calls and video streams.
To evaluate our method, we emulate two new datasets that real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
arXiv Detail & Related papers (2023-03-23T17:58:05Z) - Contrastive Losses Are Natural Criteria for Unsupervised Video
Summarization [27.312423653997087]
Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing.
We propose three metrics featuring a desirable key frame: local dissimilarity, global consistency, and uniqueness.
We show that by refining the pre-trained features with a lightweight contrastively learned projection module, the frame-level importance scores can be further improved.
arXiv Detail & Related papers (2022-11-18T07:01:28Z) - NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition [89.84188594758588]
A novel Non-saliency Suppression Network (NSNet) is proposed to suppress the responses of non-salient frames.
NSNet achieves the state-of-the-art accuracy-efficiency trade-off and presents a significantly faster (2.44.3x) practical inference speed than state-of-the-art methods.
arXiv Detail & Related papers (2022-07-21T09:41:22Z) - A Simple Baseline for Video Restoration with Grouped Spatial-temporal
Shift [36.71578909392314]
In this study, we propose a simple yet effective framework for video restoration.
Our approach is based on grouped spatial-temporal shift, which is a lightweight and straightforward technique.
Our framework outperforms the previous state-of-the-art method, while using less than a quarter of its computational cost.
arXiv Detail & Related papers (2022-06-22T02:16:47Z) - Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval [55.088635195893325]
We propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ)
HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos.
Experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods.
arXiv Detail & Related papers (2022-02-07T18:04:10Z) - Action Keypoint Network for Efficient Video Recognition [63.48422805355741]
This paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net)
AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of action keypoints and then transforms the video recognition into point cloud classification.
Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.
arXiv Detail & Related papers (2022-01-17T09:35:34Z) - Adaptive Focus for Efficient Video Recognition [29.615394426035074]
We propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus)
A light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions.
During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices.
arXiv Detail & Related papers (2021-05-07T13:24:47Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.