Related papers: VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

URL: http://arxiv.org/abs/2408.16730v1
Date: Thu, 29 Aug 2024 17:21:58 GMT
Title: VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
Authors: Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou,
Abstract summary: We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
Score: 66.00245701441547
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A well-known dilemma in large vision-language models (e.g., GPT-4, LLaVA) is that while increasing the number of vision tokens generally enhances visual understanding, it also significantly raises memory and computational costs, especially in long-term, dense video frame streaming scenarios. Although learnable approaches like Q-Former and Perceiver Resampler have been developed to reduce the vision token burden, they overlook the context causally modeled by LLMs (i.e., key-value cache), potentially leading to missed visual cues when addressing user queries. In this paper, we introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video. Specifically, for each transformer layer, we learn to skip the computation for a high proportion (e.g., 80\%) of vision tokens, passing them directly to the next layer. This approach significantly enhances model efficiency, achieving approximately \textasciitilde42\% time and \textasciitilde30\% memory savings for the entire training. Moreover, our method reduces the computation in the context and avoid decreasing the vision tokens, thus preserving or even improving performance compared to the vanilla model. We conduct extensive experiments to demonstrate the effectiveness of VideoLLM-MoD, showing its state-of-the-art results on multiple benchmarks, including narration, forecasting, and summarization tasks in COIN, Ego4D, and Ego-Exo4D datasets.

Related papers

DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs) Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity. Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z)
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping [13.846838416902575]
A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding. We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models. Experimental results demonstrate that Skip-Vision reduces training time by up to 35%, inference FLOPs by 75%, and latency by 45%.
arXiv Detail & Related papers (2025-03-26T04:16:48Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs. We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption. compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z)
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs. Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z)
Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See [37.7015406019386]
Multimodal Large Language Models (MLLMs) treat visual tokens from visual encoders as text tokens. As token counts grow, the quadratic scaling of computation in LLMs introduces an efficiency bottleneck. In this study, we investigate the redundancy in visual computation at both the parameter and computational pattern levels within LLaVA.
arXiv Detail & Related papers (2024-10-08T16:13:24Z)
Visual Context Window Extension: A New Perspective for Long Video Understanding [45.134271969594614]
We tackle the challenge of long video understanding from the perspective of context windows. We propose to adapt LMMs for long video understanding tasks by extending the visual context window. Our method consistently improves the performance as the number of video frames increases.
arXiv Detail & Related papers (2024-09-30T07:25:16Z)
Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving [9.900979396513687]
Multimodal large language models (MLLMs) have demonstrated remarkable potential for enhancing scene understanding in autonomous driving systems. One major limitation arises from the large number of visual tokens required to capture fine-grained and long-context visual information. We propose Video Token Sparsification (VTS) to significantly reduce the total number of visual tokens while preserving the most salient information.
arXiv Detail & Related papers (2024-09-16T05:31:01Z)
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions. We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images. Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z)
Long Context Transfer from Language to Vision [74.78422371545716]
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. In this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training.
arXiv Detail & Related papers (2024-06-24T17:58:06Z)
VoCo-LLaMA: Towards Vision Compression with Large Language Models [56.20788367278211]
Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window. We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. Our method achieves minimal performance loss with a compression ratio of 576$times$, resulting in up to 94.8$%$ fewer FLOPs and 69.6$%$ acceleration in inference time.
arXiv Detail & Related papers (2024-06-18T05:05:12Z)
Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language. This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok) SeTok groups visual features into semantic units via a dynamic clustering algorithm. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z)
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models [35.88374542519597]
Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly. We propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs.
arXiv Detail & Related papers (2024-03-22T17:59:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.