LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
- URL: http://arxiv.org/abs/2410.17434v1
- Date: Tue, 22 Oct 2024 21:21:37 GMT
- Title: LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
- Authors: Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra,
- Abstract summary: LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
- Score: 65.46303012350207
- License:
- Abstract: Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.
Related papers
- ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding [55.320254859515714]
We introduce a training-free method, $bfReTaKe$, to reduce both temporal visual redundancy and knowledge redundancy for long video understanding.
DPSelect identifies Videos with local maximum peak distance based on their visual features, which are closely aligned with human video perception.
PivotKV employs VideoBenchs as pivots and conducts KV-Cache compression for the non-text tokens with low attention scores.
arXiv Detail & Related papers (2024-12-29T15:42:24Z) - IQViC: In-context, Question Adaptive Vision Compressor for Long-term Video Understanding LMMs [0.0]
We propose a framework for long-term video understanding that incorporates a novel visual compressor, the In-context, Question Adaptive Visual (IQViC)
IQViC, a transformer-based visual compressor, enables question-conditioned in-context compression, unlike existing methods that rely on full video visual features.
We demonstrate the effectiveness of our proposed IQViC framework and its superiority over state-of-the-art methods in terms of video understanding accuracy and memory efficiency.
arXiv Detail & Related papers (2024-12-13T06:52:02Z) - AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction [10.579335027350263]
AdaCM$2$ is an adaptive cross-modality memory reduction approach to video-text alignment on video streams.
It achieves a 4.5% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65%.
arXiv Detail & Related papers (2024-11-19T18:04:13Z) - Visual Context Window Extension: A New Perspective for Long Video Understanding [45.134271969594614]
We tackle the challenge of long video understanding from the perspective of context windows.
We propose to adapt LMMs for long video understanding tasks by extending the visual context window.
Our method consistently improves the performance as the number of video frames increases.
arXiv Detail & Related papers (2024-09-30T07:25:16Z) - Long Context Transfer from Language to Vision [74.78422371545716]
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos.
In this paper, we approach this problem from the perspective of the language model.
By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training.
arXiv Detail & Related papers (2024-06-24T17:58:06Z) - Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding.
It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected.
Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z) - LVCHAT: Facilitating Long Video Comprehension [25.395689904747965]
We propose Long Video Chat (LVChat) to enable multimodal large language models (LLMs) to read videos.
LV significantly outperforms existing methods by up to 27% in accuracy on long-video QA datasets and long-video captioning benchmarks.
arXiv Detail & Related papers (2024-02-19T11:59:14Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.