Compressed Vision for Efficient Video Understanding
- URL: http://arxiv.org/abs/2210.02995v1
- Date: Thu, 6 Oct 2022 15:35:49 GMT
- Title: Compressed Vision for Efficient Video Understanding
- Authors: Olivia Wiles and Joao Carreira and Iain Barr and Andrew Zisserman and
Mateusz Malinowski
- Abstract summary: We propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos.
We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks.
- Score: 83.97689018324732
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Experience and reasoning occur across multiple temporal scales: milliseconds,
seconds, hours or days. The vast majority of computer vision research, however,
still focuses on individual images or short videos lasting only a few seconds.
This is because handling longer videos require more scalable approaches even to
process them. In this work, we propose a framework enabling research on
hour-long videos with the same hardware that can now process second-long
videos. We replace standard video compression, e.g. JPEG, with neural
compression and show that we can directly feed compressed videos as inputs to
regular video networks. Operating on compressed videos improves efficiency at
all pipeline levels -- data transfer, speed and memory -- making it possible to
train models faster and on much longer videos. Processing compressed signals
has, however, the downside of precluding standard augmentation techniques if
done naively. We address that by introducing a small network that can apply
transformations to latent codes corresponding to commonly used augmentations in
the original video space. We demonstrate that with our compressed vision
pipeline, we can train video models more efficiently on popular benchmarks such
as Kinetics600 and COIN. We also perform proof-of-concept experiments with new
tasks defined over hour-long videos at standard frame rates. Processing such
long videos is impossible without using compressed representation.
Related papers
- Large Motion Video Autoencoding with Cross-modal Video VAE [52.13379965800485]
Video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation.
Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance.
We present a novel and powerful video autoencoder capable of high-fidelity video encoding.
arXiv Detail & Related papers (2024-12-23T18:58:24Z) - PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models [64.9366388601049]
Visual token compression is leveraged to reduce the considerable token length of visual inputs.
We introduce a unified token compression strategy called Progressive Visual Token Compression.
Our model achieves state-of-the-art performance across various video understanding benchmarks.
arXiv Detail & Related papers (2024-12-12T18:59:40Z) - REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents [110.41795676048835]
One crucial obstacle for large-scale applications is the expensive training and inference cost.
In this paper, we argue that videos contain much more redundant information than images, thus can be encoded by very few motion latents.
We train Reducio-DiT in around 3.2K training hours in total and generate a 16-frame 1024*1024 video clip within 15.5 seconds on a single A100 GPU.
arXiv Detail & Related papers (2024-11-20T18:59:52Z) - Accurate and Fast Compressed Video Captioning [28.19362369787383]
Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process.
We study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline.
We propose a simple yet effective end-to-end transformer in the compressed domain for video captioning that enables learning from the compressed video for captioning.
arXiv Detail & Related papers (2023-09-22T13:43:22Z) - MagicVideo: Efficient Video Generation With Latent Diffusion Models [76.95903791630624]
We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo.
Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card.
We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content.
arXiv Detail & Related papers (2022-11-20T16:40:31Z) - Speeding Up Action Recognition Using Dynamic Accumulation of Residuals
in Compressed Domain [2.062593640149623]
Temporal redundancy and the sheer size of raw videos are the two most common problematic issues related to video processing algorithms.
This paper presents an approach for using residual data, available in compressed videos directly, which can be obtained by a light partially decoding procedure.
Applying neural networks exclusively for accumulated residuals in the compressed domain accelerates performance, while the classification results are highly competitive with raw video approaches.
arXiv Detail & Related papers (2022-09-29T13:08:49Z) - Leveraging Bitstream Metadata for Fast, Accurate, Generalized Compressed
Video Quality Enhancement [74.1052624663082]
We develop a deep learning architecture capable of restoring detail to compressed videos.
We show that this improves restoration accuracy compared to prior compression correction methods.
We condition our model on quantization data which is readily available in the bitstream.
arXiv Detail & Related papers (2022-01-31T18:56:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.