VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models
- URL: http://arxiv.org/abs/2410.11417v1
- Date: Tue, 15 Oct 2024 09:07:25 GMT
- Title: VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models
- Authors: Xiaohan Lan, Yitian Yuan, Zequn Jie, Lin Ma,
- Abstract summary: VidCompress is a novel Video-LLM featuring memory-enhanced temporal compression.
It efficiently models complex temporal-spatial relations and significantly outperforms existing Video-LLMs.
- Score: 25.668485023831874
- License:
- Abstract: Video-based multimodal large language models (Video-LLMs) possess significant potential for video understanding tasks. However, most Video-LLMs treat videos as a sequential set of individual frames, which results in insufficient temporal-spatial interaction that hinders fine-grained comprehension and difficulty in processing longer videos due to limited visual token capacity. To address these challenges, we propose VidCompress, a novel Video-LLM featuring memory-enhanced temporal compression. VidCompress employs a dual-compressor approach: a memory-enhanced compressor captures both short-term and long-term temporal relationships in videos and compresses the visual tokens using a multiscale transformer with a memory-cache mechanism, while a text-perceived compressor generates condensed visual tokens by utilizing Q-Former and integrating temporal contexts into query embeddings with cross attention. Experiments on several VideoQA datasets and comprehensive benchmarks demonstrate that VidCompress efficiently models complex temporal-spatial relations and significantly outperforms existing Video-LLMs.
Related papers
- $\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation [19.616624959353697]
$infty$-Video can process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism.
Our framework augments video Q-formers by allowing them to process video contexts efficiently and without requiring additional training.
arXiv Detail & Related papers (2025-01-31T12:45:46Z) - Large Motion Video Autoencoding with Cross-modal Video VAE [52.13379965800485]
Video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation.
Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance.
We present a novel and powerful video autoencoder capable of high-fidelity video encoding.
arXiv Detail & Related papers (2024-12-23T18:58:24Z) - IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs [0.0]
We propose a framework for long-term video understanding that incorporates a novel visual compressor, the In-context, Question Adaptive Visual (IQViC)
IQViC, a transformer-based visual compressor, enables question-conditioned in-context compression, unlike existing methods that rely on full video visual features.
We demonstrate the effectiveness of our proposed IQViC framework and its superiority over state-of-the-art methods in terms of video understanding accuracy and memory efficiency.
arXiv Detail & Related papers (2024-12-13T06:52:02Z) - PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models [64.9366388601049]
Visual token compression is leveraged to reduce the considerable token length of visual inputs.
We introduce a unified token compression strategy called Progressive Visual Token Compression.
Our model achieves state-of-the-art performance across various video understanding benchmarks.
arXiv Detail & Related papers (2024-12-12T18:59:40Z) - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - VoCo-LLaMA: Towards Vision Compression with Large Language Models [56.20788367278211]
Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window.
We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs.
Our method achieves minimal performance loss with a compression ratio of 576$times$, resulting in up to 94.8$%$ fewer FLOPs and 69.6$%$ acceleration in inference time.
arXiv Detail & Related papers (2024-06-18T05:05:12Z) - Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding.
It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected.
Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z) - VidToMe: Video Token Merging for Zero-Shot Video Editing [100.79999871424931]
We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames.
Our method improves temporal coherence and reduces memory consumption in self-attention computations.
arXiv Detail & Related papers (2023-12-17T09:05:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.