Judging a video by its bitstream cover
- URL: http://arxiv.org/abs/2309.07361v1
- Date: Thu, 14 Sep 2023 00:34:11 GMT
- Title: Judging a video by its bitstream cover
- Authors: Yuxing Han, Yunan Ding, Jiangtao Wen, Chen Ye Gan
- Abstract summary: Classifying videos into distinct categories, such as Sport and Music Video, is crucial for multimedia understanding and retrieval.
Traditional methods require video decompression to extract pixel-level features like color, texture, and motion.
We present a novel approach that examines only the post-compression bitstream of a video to perform classification, eliminating the need for bitstream.
- Score: 12.322783570127756
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Classifying videos into distinct categories, such as Sport and Music Video,
is crucial for multimedia understanding and retrieval, especially in an age
where an immense volume of video content is constantly being generated.
Traditional methods require video decompression to extract pixel-level features
like color, texture, and motion, thereby increasing computational and storage
demands. Moreover, these methods often suffer from performance degradation in
low-quality videos. We present a novel approach that examines only the
post-compression bitstream of a video to perform classification, eliminating
the need for bitstream. We validate our approach using a custom-built data set
comprising over 29,000 YouTube video clips, totaling 6,000 hours and spanning
11 distinct categories. Our preliminary evaluations indicate precision,
accuracy, and recall rates well over 80%. The algorithm operates approximately
15,000 times faster than real-time for 30fps videos, outperforming traditional
Dynamic Time Warping (DTW) algorithm by six orders of magnitude.
Related papers
- Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning [71.94122309290537]
We propose an efficient, online approach to generate dense captions for videos.
Our model uses a novel autoregressive factorized decoding architecture.
Our approach shows excellent performance compared to both offline and online methods, and uses 20% less compute.
arXiv Detail & Related papers (2024-11-22T02:46:44Z) - Adaptive Caching for Faster Video Generation with Diffusion Transformers [52.73348147077075]
Diffusion Transformers (DiTs) rely on larger models and heavier attention mechanisms, resulting in slower inference speeds.
We introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache)
We also introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, controlling the compute allocation based on motion content.
arXiv Detail & Related papers (2024-11-04T18:59:44Z) - Leveraging Compressed Frame Sizes For Ultra-Fast Video Classification [12.322783570127756]
Classifying videos into distinct categories, such as Sport and Music Video, is crucial for multimedia understanding and retrieval.
Traditional methods require video decompression to extract pixel-level features like color, texture, and motion.
We present a novel approach that examines only the post-compression bitstream of a video to perform classification, eliminating the need for bitstream decoding.
arXiv Detail & Related papers (2024-03-13T14:35:13Z) - Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos.
To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process.
The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z) - Compressed Vision for Efficient Video Understanding [83.97689018324732]
We propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos.
We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks.
arXiv Detail & Related papers (2022-10-06T15:35:49Z) - Speeding Up Action Recognition Using Dynamic Accumulation of Residuals
in Compressed Domain [2.062593640149623]
Temporal redundancy and the sheer size of raw videos are the two most common problematic issues related to video processing algorithms.
This paper presents an approach for using residual data, available in compressed videos directly, which can be obtained by a light partially decoding procedure.
Applying neural networks exclusively for accumulated residuals in the compressed domain accelerates performance, while the classification results are highly competitive with raw video approaches.
arXiv Detail & Related papers (2022-09-29T13:08:49Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Encode the Unseen: Predictive Video Hashing for Scalable Mid-Stream
Retrieval [12.17757623963458]
This paper tackles a new problem in computer vision: mid-stream video-to-video retrieval.
We present the first hashing framework that infers the unseen future content of a currently playing video.
Our approach also yields a significant mAP@20 performance increase compared to a baseline adapted from the literature for this task.
arXiv Detail & Related papers (2020-09-30T13:25:59Z) - Subjective and Objective Quality Assessment of High Frame Rate Videos [60.970191379802095]
High frame rate (HFR) videos are becoming increasingly common with the tremendous popularity of live, high-action streaming content such as sports.
Live-YT-HFR dataset is comprised of 480 videos having 6 different frame rates, obtained from 16 diverse contents.
To obtain subjective labels on the videos, we conducted a human study yielding 19,000 human quality ratings obtained from a pool of 85 human subjects.
arXiv Detail & Related papers (2020-07-22T19:11:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.