B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens
- URL: http://arxiv.org/abs/2412.09919v1
- Date: Fri, 13 Dec 2024 07:13:40 GMT
- Title: B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens
- Authors: Zhuqiang Lu, Zhenfei Yin, Mengwei He, Zhihui Wang, Zicheng Liu, Zhiyong Wang, Kun Hu,
- Abstract summary: Large Language Models (VLLMs) integrated with vision encoders have shown promising performance in visions understanding.<n>We present Balanced Vision-VLLM (B-VLLM): a novel VLLM framework that aims to effectively leverage task relevant-temporal cues.<n>B-VLLM is effective in balancing the number of frames and visual tokens in video understanding, yielding superior performance on various video understanding.
- Score: 34.83004283826509
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, Vision Large Language Models (VLLMs) integrated with vision encoders have shown promising performance in vision understanding. The key of VLLMs is to encode visual content into sequences of visual tokens, enabling VLLMs to simultaneously process both visual and textual content. However, understanding videos, especially long videos, remain a challenge to VLLMs as the number of visual tokens grows rapidly when encoding videos, resulting in the risk of exceeding the context window of VLLMs and introducing heavy computation burden. To restrict the number of visual tokens, existing VLLMs either: (1) uniformly downsample videos into a fixed number of frames or (2) reducing the number of visual tokens encoded from each frame. We argue the former solution neglects the rich temporal cue in videos and the later overlooks the spatial details in each frame. In this work, we present Balanced-VLLM (B-VLLM): a novel VLLM framework that aims to effectively leverage task relevant spatio-temporal cues while restricting the number of visual tokens under the VLLM context window length. At the core of our method, we devise a text-conditioned adaptive frame selection module to identify frames relevant to the visual understanding task. The selected frames are then de-duplicated using a temporal frame token merging technique. The visual tokens of the selected frames are processed through a spatial token sampling module and an optional spatial token merging strategy to achieve precise control over the token count. Experimental results show that B-VLLM is effective in balancing the number of frames and visual tokens in video understanding, yielding superior performance on various video understanding benchmarks. Our code is available at https://github.com/zhuqiangLu/B-VLLM.
Related papers
- QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension [86.0749609778104]
We propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models.
QuoTA strategically allocates frame-level importance scores based on query relevance.
We decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring.
arXiv Detail & Related papers (2025-03-11T17:59:57Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.
We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Adaptive Keyframe Sampling for Long Video Understanding [75.7837692594814]
This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS)
It inserts a plug-and-play module known as Adaptive Keyframe Sampling (AKS) which aims to maximize the useful information with a fixed number of video tokens.
Experiments on two long video understanding benchmarks validate that AKS improves video QA accuracy upon selecting informative encounters.
arXiv Detail & Related papers (2025-02-28T17:46:29Z) - [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster [26.025260449905577]
Existing methods assess the importance of visual tokens based on the text-visual cross-attentions in large language models (LLMs)<n>We introduce FasterVLM, a training-free visual token pruning method that evaluates the importance of visual tokens more accurately.<n>FasterVLM can prune 95% of visual tokens while maintaining 90% of the performance of LLaVA-1.5-7B.
arXiv Detail & Related papers (2024-12-02T18:57:40Z) - TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models [52.590072198551944]
Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents.
For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data.
In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM.
arXiv Detail & Related papers (2024-11-17T13:08:29Z) - SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference [45.11612407862277]
In vision-language models (VLMs), visual tokens usually bear a significant amount of computational overhead despite sparsity of information in them when compared to text tokens.
We propose a text-guided training-free token optimization mechanism dubbed SparseVLM that eliminates the need of extra parameters or fine-tuning costs.
arXiv Detail & Related papers (2024-10-06T09:18:04Z) - TokenPacker: Efficient Visual Projector for Multimodal LLM [37.1071749188282]
The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM)
We propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens.
Our approach compresses the visual tokens by 75%89%, while achieves comparable or even better performance across diverse benchmarks.
arXiv Detail & Related papers (2024-07-02T16:10:55Z) - Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.
This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)
SeTok groups visual features into semantic units via a dynamic clustering algorithm.
The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models [66.40252169137447]
We present a novel method to tackle the token generation challenge in Vision Language Models (VLMs) for video and image understanding, called LLaMA-VID.
LLaMA-VID addresses this issue by representing each frame with two distinct tokens, namely context token and content token.
This dual-token strategy significantly reduces the overload of long videos while preserving critical information.
arXiv Detail & Related papers (2023-11-28T18:53:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.