Related papers: Leveraging Compressed Frame Sizes For Ultra-Fast Video Classification

Leveraging Compressed Frame Sizes For Ultra-Fast Video Classification

URL: http://arxiv.org/abs/2403.08580v1
Date: Wed, 13 Mar 2024 14:35:13 GMT
Title: Leveraging Compressed Frame Sizes For Ultra-Fast Video Classification
Authors: Yuxing Han, Yunan Ding, Chen Ye Gan, Jiangtao Wen
Abstract summary: Classifying videos into distinct categories, such as Sport and Music Video, is crucial for multimedia understanding and retrieval. Traditional methods require video decompression to extract pixel-level features like color, texture, and motion. We present a novel approach that examines only the post-compression bitstream of a video to perform classification, eliminating the need for bitstream decoding.
Score: 12.322783570127756
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Classifying videos into distinct categories, such as Sport and Music Video, is crucial for multimedia understanding and retrieval, especially when an immense volume of video content is being constantly generated. Traditional methods require video decompression to extract pixel-level features like color, texture, and motion, thereby increasing computational and storage demands. Moreover, these methods often suffer from performance degradation in low-quality videos. We present a novel approach that examines only the post-compression bitstream of a video to perform classification, eliminating the need for bitstream decoding. To validate our approach, we built a comprehensive data set comprising over 29,000 YouTube video clips, totaling 6,000 hours and spanning 11 distinct categories. Our evaluations indicate precision, accuracy, and recall rates consistently above 80%, many exceeding 90%, and some reaching 99%. The algorithm operates approximately 15,000 times faster than real-time for 30fps videos, outperforming traditional Dynamic Time Warping (DTW) algorithm by seven orders of magnitude.

Related papers

Automatic Detection of Intro and Credits in Video using CLIP and Multihead Attention [0.0]
We introduce a deep learning-based approach that formulates the problem as a sequence-to-sequence classification task. Our method extracts frames at a fixed rate of 1 FPS, encodes them using CLIP, and processes the resulting feature representations with a multihead attention model. The system achieves an F1-score of 91.0%, Precision of 89.0%, and Recall of 97.0% on the test set, and is optimized for real-time inference.
arXiv Detail & Related papers (2025-04-13T22:08:18Z)
Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning [71.94122309290537]
We propose an efficient, online approach to generate dense captions for videos. Our model uses a novel autoregressive factorized decoding architecture. Our approach shows excellent performance compared to both offline and online methods, and uses 20% less compute.
arXiv Detail & Related papers (2024-11-22T02:46:44Z)
Adaptive Caching for Faster Video Generation with Diffusion Transformers [52.73348147077075]
Diffusion Transformers (DiTs) rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. We introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache) We also introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, controlling the compute allocation based on motion content.
arXiv Detail & Related papers (2024-11-04T18:59:44Z)
Blurry Video Compression: A Trade-off between Visual Enhancement and Data Compression [65.8148169700705]
Existing video compression (VC) methods primarily aim to reduce the spatial and temporal redundancies between consecutive frames in a video. Previous works have achieved remarkable results on videos acquired under specific settings such as instant (known) exposure time and shutter speed. In this work, we tackle the VC problem in a general scenario where a given video can be blurry due to predefined camera settings or dynamics in the scene.
arXiv Detail & Related papers (2023-11-08T02:17:54Z)
Judging a video by its bitstream cover [12.322783570127756]
Classifying videos into distinct categories, such as Sport and Music Video, is crucial for multimedia understanding and retrieval. Traditional methods require video decompression to extract pixel-level features like color, texture, and motion. We present a novel approach that examines only the post-compression bitstream of a video to perform classification, eliminating the need for bitstream.
arXiv Detail & Related papers (2023-09-14T00:34:11Z)
LSCD: A Large-Scale Screen Content Dataset for Video Compression [5.857003653854907]
We propose the Large-scale Screen Content dataset, which contains 714 source sequences. We provide the analysis of the proposed dataset to show some features of screen content videos. We also provide a benchmark containing the performance of both traditional and learning-based methods.
arXiv Detail & Related papers (2023-08-18T06:27:35Z)
Compressed Vision for Efficient Video Understanding [83.97689018324732]
We propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks.
arXiv Detail & Related papers (2022-10-06T15:35:49Z)
Speeding Up Action Recognition Using Dynamic Accumulation of Residuals in Compressed Domain [2.062593640149623]
Temporal redundancy and the sheer size of raw videos are the two most common problematic issues related to video processing algorithms. This paper presents an approach for using residual data, available in compressed videos directly, which can be obtained by a light partially decoding procedure. Applying neural networks exclusively for accumulated residuals in the compressed domain accelerates performance, while the classification results are highly competitive with raw video approaches.
arXiv Detail & Related papers (2022-09-29T13:08:49Z)
Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration. This enables the learning of long-range dependencies beyond a single clip. Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z)
Content Adaptive and Error Propagation Aware Deep Video Compression [110.31693187153084]
We propose a content adaptive and error propagation aware video compression system. Our method employs a joint training strategy by considering the compression performance of multiple consecutive frames instead of a single frame. Instead of using the hand-crafted coding modes in the traditional compression systems, we design an online encoder updating scheme in our system.
arXiv Detail & Related papers (2020-03-25T09:04:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.