Related papers: VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models

VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models

URL: http://arxiv.org/abs/2503.16980v6
Date: Mon, 29 Sep 2025 01:09:31 GMT
Title: VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models
Authors: Haichao Zhang, Yun Fu,
Abstract summary: We introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens.<n>On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark.
Score: 35.38573641029626
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Token-based video representation has emerged as a promising approach for enabling large language models (LLMs) to interpret video content. However, existing token reduction techniques, such as pruning and merging, often disrupt essential positional embeddings and rely on continuous visual tokens sampled from nearby pixels with similar spatial-temporal locations. By removing only a small fraction of tokens, these methods still produce relatively lengthy continuous sequences, which falls short of the extreme compression required to balance computational efficiency and token count in video LLMs. In this paper, we introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens. We propose VQToken, a neural discrete token representation framework that (i) applies adaptive vector quantization to continuous ViT embeddings to learn a compact codebook and (ii) preserves spatial-temporal positions via a token hash function by assigning each grid-level token to its nearest codebook entry. On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark. It also achieves comparable performance on ActNet-QA, Long Video Bench, and VideoMME. We further introduce the Token Information Density (TokDense) metric and formalize fixed-length and adaptive-length subtasks, achieving state-of-the-art results in both settings. Our approach dramatically lowers theoretical complexity, increases information density, drastically reduces token counts, and enables efficient video LLMs in resource-constrained environments.

Related papers

TrajTok: Learning Trajectory Tokens enables better Video Understanding [63.1260672430712]
Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens.<n>We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective.<n>We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.
arXiv Detail & Related papers (2026-02-26T09:15:34Z)
VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents [33.80068883432077]
This work presents VTok, a unified video tokenization framework that can be used for both generation and understanding tasks.<n>We propose to decouple the spatial and temporal representations of videos by retaining the spatial features of a single key frame while encoding each subsequent frame into a single residual token.<n>Our experiments suggest that VTok effectively reduces the complexity of video representation from the product of frame count and per-frame token count to their sum.
arXiv Detail & Related papers (2026-02-04T04:39:46Z)
Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models [24.875526594002434]
We present QTSplus, a visual token selection module for long video understanding scenarios.<n>It is integrated into Qwen2.5-VL and compresses the vision stream by up to textbf89% and reduces end-to-end latency by textbf28% on long videos.<n>Results show that QTSplus is an effective, general mechanism for scaling MLLMs to realworld long-video scenarios.
arXiv Detail & Related papers (2025-11-14T22:41:27Z)
FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding [55.700832127331324]
FLoC is an efficient visual token compression framework based on the facility location function.<n>Our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens.<n>Our approach is training-free, model-agnostic, and query-agnostic, providing a versatile solution.
arXiv Detail & Related papers (2025-10-31T17:29:39Z)
SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs [59.415473779171315]
We propose a novel visual token pruning strategy called textbfSaliency-textbfCoverage textbfOriented token textbfPruning for textbfEfficient MLLMs.
arXiv Detail & Related papers (2025-10-28T09:29:37Z)
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs [23.801172170798132]
LLaVA-Scissor is a training-free token compression strategy designed for multimodal large language models.<n>We propose to leverage the Semantic Connected Components ( SCC) approach to ensure comprehensive semantic coverage.<n>We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks.
arXiv Detail & Related papers (2025-06-27T02:29:58Z)
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms [16.41418610688371]
We introduce CrossLMM, which substantially reduces visual token quantity with minimal performance degradation.<n>We also introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens.<n>Our approach achieves comparable or superior performance across diverse video-based Large Language Models benchmarks.
arXiv Detail & Related papers (2025-05-22T17:59:53Z)
Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [63.89280381800457]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens.<n>We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism.<n>Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption.<n> compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.<n>We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z)
Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model [45.01871133425388]
We propose Multi-stage Token Dropping (MustDrop) to measure the importance of each token from the whole lifecycle. MustDrop reduces about 88.5% FLOPs on LLaVA with a compression ratio of 92.2% while maintaining comparable accuracy.
arXiv Detail & Related papers (2024-11-16T13:45:33Z)
Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image. We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)
Video Token Merging for Long-form Video Understanding [17.59960070514554]
We propose a learnable video token merging algorithm that dynamically merges tokens based on their saliency. Our approach significantly reduces memory costs by 84% and boosts throughput by approximately 6.89 times compared to baseline algorithms.
arXiv Detail & Related papers (2024-10-31T09:55:32Z)
ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.<n>During inference, ElasticTok can dynamically allocate tokens when needed.<n>Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z)
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models [35.88374542519597]
Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly. We propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs.
arXiv Detail & Related papers (2024-03-22T17:59:52Z)
AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. We propose to apply adaptive resolution for different regions in the image according to their importance. We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z)
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval [67.21528544724546]
In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos. This significantly increases computation costs and hinders the deployment of video retrieval models in web applications. In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
arXiv Detail & Related papers (2022-05-02T12:02:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.