Related papers: Compressed Vision for Efficient Video Understanding

Compressed Vision for Efficient Video Understanding

URL: http://arxiv.org/abs/2210.02995v1
Date: Thu, 6 Oct 2022 15:35:49 GMT
Title: Compressed Vision for Efficient Video Understanding
Authors: Olivia Wiles and Joao Carreira and Iain Barr and Andrew Zisserman and Mateusz Malinowski
Abstract summary: We propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks.
Score: 83.97689018324732
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks. Operating on compressed videos improves efficiency at all pipeline levels -- data transfer, speed and memory -- making it possible to train models faster and on much longer videos. Processing compressed signals has, however, the downside of precluding standard augmentation techniques if done naively. We address that by introducing a small network that can apply transformations to latent codes corresponding to commonly used augmentations in the original video space. We demonstrate that with our compressed vision pipeline, we can train video models more efficiently on popular benchmarks such as Kinetics600 and COIN. We also perform proof-of-concept experiments with new tasks defined over hour-long videos at standard frame rates. Processing such long videos is impossible without using compressed representation.

Related papers

Large Motion Video Autoencoding with Cross-modal Video VAE [52.13379965800485]
Video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance. We present a novel and powerful video autoencoder capable of high-fidelity video encoding.
arXiv Detail & Related papers (2024-12-23T18:58:24Z)
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models [64.9366388601049]
Visual token compression is leveraged to reduce the considerable token length of visual inputs. We introduce a unified token compression strategy called Progressive Visual Token Compression. Our model achieves state-of-the-art performance across various video understanding benchmarks.
arXiv Detail & Related papers (2024-12-12T18:59:40Z)
REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents [110.41795676048835]
One crucial obstacle for large-scale applications is the expensive training and inference cost. In this paper, we argue that videos contain much more redundant information than images, thus can be encoded by very few motion latents. We train Reducio-DiT in around 3.2K training hours in total and generate a 16-frame 1024*1024 video clip within 15.5 seconds on a single A100 GPU.
arXiv Detail & Related papers (2024-11-20T18:59:52Z)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
Accurate and Fast Compressed Video Captioning [28.19362369787383]
Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process. We study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline. We propose a simple yet effective end-to-end transformer in the compressed domain for video captioning that enables learning from the compressed video for captioning.
arXiv Detail & Related papers (2023-09-22T13:43:22Z)
MagicVideo: Efficient Video Generation With Latent Diffusion Models [76.95903791630624]
We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo. Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card. We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content.
arXiv Detail & Related papers (2022-11-20T16:40:31Z)
Speeding Up Action Recognition Using Dynamic Accumulation of Residuals in Compressed Domain [2.062593640149623]
Temporal redundancy and the sheer size of raw videos are the two most common problematic issues related to video processing algorithms. This paper presents an approach for using residual data, available in compressed videos directly, which can be obtained by a light partially decoding procedure. Applying neural networks exclusively for accumulated residuals in the compressed domain accelerates performance, while the classification results are highly competitive with raw video approaches.
arXiv Detail & Related papers (2022-09-29T13:08:49Z)
Leveraging Bitstream Metadata for Fast, Accurate, Generalized Compressed Video Quality Enhancement [74.1052624663082]
We develop a deep learning architecture capable of restoring detail to compressed videos. We show that this improves restoration accuracy compared to prior compression correction methods. We condition our model on quantization data which is readily available in the bitstream.
arXiv Detail & Related papers (2022-01-31T18:56:04Z)
Diverse Generation from a Single Video Made Possible [24.39972895902724]
We present a fast and practical method for video generation and manipulation from a single natural video. Our method generates more realistic and higher quality results than single-video GANs.
arXiv Detail & Related papers (2021-09-17T15:12:17Z)
Conditional Entropy Coding for Efficient Video Compression [82.35389813794372]
We propose a very simple and efficient video compression framework that only focuses on modeling the conditional entropy between frames. We first show that a simple architecture modeling the entropy between the image latent codes is as competitive as other neural video compression works and video codecs. We then propose a novel internal learning extension on top of this architecture that brings an additional 10% savings without trading off decoding speed.
arXiv Detail & Related papers (2020-08-20T20:01:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.