Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model
- URL: http://arxiv.org/abs/2412.04729v3
- Date: Fri, 16 May 2025 14:23:46 GMT
- Title: Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model
- Authors: Keunwoo Peter Yu, Achal Dave, Rares Ambrus, Jean Mercat,
- Abstract summary: We introduce $textttEspresso$, a new architecture that separately compresses spatial and temporal features into fixed-length sequences.<n>Experiments show that fixed-length compression combined with segment-wise processing offers a scalable and competitive alternative to pooling-based approaches.
- Score: 15.320117192047265
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in vision-language models (VLMs) have shown great promise in connecting images and text, but extending these models to long videos remains challenging due to the rapid growth in token counts. Models that compress videos by local aggregation in time or space have become popular for handling long-form inputs; however, these pooling-based projectors sacrifice the benefits of fixed-length representations that are crucial for streaming and efficient video understanding. We introduce $\texttt{Espresso}$, a new architecture that separately compresses spatial and temporal features into fixed-length sequences. $\texttt{Espresso}$ enables efficient video encoding while maintaining strong long-form reasoning capabilities. Experiments show that fixed-length compression combined with segment-wise processing offers a scalable and competitive alternative to pooling-based approaches. Our results demonstrate that fixed-length projectors, when properly designed and trained, remain a viable foundation for video-language modeling.
Related papers
- LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z) - Slow-Fast Architecture for Video Multi-Modal Large Language Models [42.3957835391319]
Existing methods compress video representations using predefined rules before feeding them into the multi-modal large language model.<n>We propose a novel slow-fast architecture that naturally circumvents this trade-off, enabling the use of more input frames while preserving spatial details.<n>Our model significantly outperforms self-attention-only baselines, extending the input capacity from 16 to 128 frames with just a 3% increase in computation.
arXiv Detail & Related papers (2025-04-02T03:24:58Z) - HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models [63.65066762436074]
HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks.
It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks.
arXiv Detail & Related papers (2025-03-14T15:36:39Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.
We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models [27.726733116479668]
We propose an End-to-end Video Subtitle Extraction method, called EVE, which consists of three modules: a vision encoder, an adapter module, and a large language model.
To effectively compress the visual tokens from the vision encoder, we propose a novel adapter InterleavedVT to interleave two modalities.
To benchmark the video subtitle extraction task, we propose a large dataset ViSa including 2.5M videos.
arXiv Detail & Related papers (2025-03-06T03:19:56Z) - Fine-Grained Captioning of Long Videos through Scene Graph Consolidation [44.30028794237688]
We introduce a novel framework for long video captioning based on graph consolidation.<n>Our approach first generates segment-level captions, corresponding to individual frames or short video intervals.<n>A lightweight graph-to-text decoder then produces the final video-level caption.
arXiv Detail & Related papers (2025-02-23T03:59:05Z) - Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces [20.860632218272094]
Video tokenizers are essential for latent video diffusion models, converting raw video data into latent spaces for efficient training.
We propose an alternative approach to enhance temporal compression.
We develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models.
arXiv Detail & Related papers (2025-01-09T18:55:15Z) - VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling [43.485687038460895]
Long-context video modeling is critical for multimodal large language models (MLLMs)
This paper aims to address this issue from aspects of model architecture, training data, training strategy and evaluation benchmark.
We build a powerful video MLLM named VideoChat-Flash, which shows a leading performance on both mainstream long and short video benchmarks.
arXiv Detail & Related papers (2024-12-31T18:01:23Z) - Large Motion Video Autoencoding with Cross-modal Video VAE [52.13379965800485]
Video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation.
Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance.
We present a novel and powerful video autoencoder capable of high-fidelity video encoding.
arXiv Detail & Related papers (2024-12-23T18:58:24Z) - PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models [64.9366388601049]
Visual token compression is leveraged to reduce the considerable token length of visual inputs.
We introduce a unified token compression strategy called Progressive Visual Token Compression.
Our model achieves state-of-the-art performance across various video understanding benchmarks.
arXiv Detail & Related papers (2024-12-12T18:59:40Z) - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models [25.668485023831874]
VidCompress is a novel Video-LLM featuring memory-enhanced temporal compression.
It efficiently models complex temporal-spatial relations and significantly outperforms existing Video-LLMs.
arXiv Detail & Related papers (2024-10-15T09:07:25Z) - Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding [25.61734041983714]
Video-XL is a novel approach that leverages MLLMs' inherent key-value sparsification capacity to condense the visual input.
Video-XL's effectiveness is verified from three aspects. First, it achieves a superior long-video understanding capability, outperforming state-of-the-art models of comparable sizes.
arXiv Detail & Related papers (2024-09-22T15:13:31Z) - VoCo-LLaMA: Towards Vision Compression with Large Language Models [56.20788367278211]
Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window.
We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs.
Our method achieves minimal performance loss with a compression ratio of 576$times$, resulting in up to 94.8$%$ fewer FLOPs and 69.6$%$ acceleration in inference time.
arXiv Detail & Related papers (2024-06-18T05:05:12Z) - Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for
Long-form Video Understanding [57.917616284917756]
Real-world videos are often several minutes long with semantically consistent segments of variable length.
A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length.
This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
arXiv Detail & Related papers (2023-09-20T18:13:32Z) - Exploring Long- and Short-Range Temporal Information for Learned Video
Compression [54.91301930491466]
We focus on exploiting the unique characteristics of video content and exploring temporal information to enhance compression performance.
For long-range temporal information exploitation, we propose temporal prior that can update continuously within the group of pictures (GOP) during inference.
In that case temporal prior contains valuable temporal information of all decoded images within the current GOP.
In detail, we design a hierarchical structure to achieve multi-scale compensation.
arXiv Detail & Related papers (2022-08-07T15:57:18Z) - Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement
Learning Method [6.172652648945223]
This paper presents a novel weakly-supervised methodology to accelerate instructional videos using text.
A novel joint reward function guides our agent to select which frames to remove and reduce the input video to a target length.
We also propose the Extended Visually-guided Document Attention Network (VDAN+), which can generate a highly discriminative embedding space.
arXiv Detail & Related papers (2022-03-29T17:43:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.