LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit
- URL: http://arxiv.org/abs/2508.09981v1
- Date: Wed, 13 Aug 2025 17:54:49 GMT
- Title: LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit
- Authors: Chengtao Lv, Bilang Zhang, Yang Yong, Ruihao Gong, Yushi Huang, Shiqiao Gu, Jiajun Wu, Yumeng Shi, Jinyang Guo, Wenya Wang,
- Abstract summary: Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands.<n>Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy.<n>We introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit.
- Score: 29.877232989285833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands, due to their long visual token sequences and massive parameter sizes. To address these issues, recent works have proposed training-free compression methods. However, existing efforts often suffer from three major limitations: (1) Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy. (2) Evaluation confined to simple single-turn tasks, failing to reflect performance in realistic scenarios. (3) Isolated use of individual compression techniques, without exploring their joint potential. To overcome these gaps, we introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit. LLMC+ supports over 20 algorithms across five representative VLM families and enables systematic study of token-level and model-level compression. Our benchmark reveals that: (1) Spatial and temporal redundancies demand distinct technical strategies. (2) Token reduction methods degrade significantly in multi-turn dialogue and detail-sensitive tasks. (3) Combining token and model compression achieves extreme compression with minimal performance loss. We believe LLMC+ will facilitate fair evaluation and inspire future research in efficient VLM. Our code is available at https://github.com/ModelTC/LightCompress.
Related papers
- Arbitrary Ratio Feature Compression via Next Token Prediction [52.10426317889982]
Arbitrary Ratio Feature Compression (ARFC) framework supports any compression ratio with a single model.<n>ARC is an auto-regressive model that performs compression via next-gressive prediction.<n>MoS module refines the compressed tokens by utilizing multiple compression results.<n>ERGC is integrated into the training process to preserve semantic and structural relationships during compression.
arXiv Detail & Related papers (2026-02-12T02:38:57Z) - VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning [55.17170420615628]
Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks.<n>We propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process.<n>Our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency.
arXiv Detail & Related papers (2026-01-29T18:07:39Z) - Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models [30.433778463779618]
We present UniPruneBench, a benchmark for visual token pruning in multimodal models.<n>UniPruneBench provides standardized protocols across six ability dimensions and ten datasets.
arXiv Detail & Related papers (2025-11-04T15:17:06Z) - VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs [82.72388893596555]
Multimodal Large Language Models (MLLMs) encounter significant computational and memory bottlenecks.<n>Previous token compression techniques are often constrained by rules that risk discarding critical information.<n>We reformulate token compression as a lightweight plug-and-play framework that reformulates token compression into an end-to-end learnable decision process.
arXiv Detail & Related papers (2025-10-18T17:54:18Z) - Compressing Many-Shots in In-Context Learning [61.231471139896506]
We study an approach to improve the memory and computational efficiency of ICL inference by compressing the many-shot prompts.<n>We first show that existing prompt compression methods are ineffective for many-shot compression.<n>We propose MemCom, a layer-wise compression method.
arXiv Detail & Related papers (2025-10-17T16:57:42Z) - LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models [62.240460476785934]
We propose LaCo (Layer-wise Visual Token Compression), a novel framework that enables effective token compression within the intermediate layers of the vision encoder.<n>LaCo introduces two core components: 1) a layer-wise pixel-shuffle mechanism that systematically merges adjacent tokens through space-to-channel transformations, and 2) a residual learning architecture with non-parametric shortcuts.
arXiv Detail & Related papers (2025-07-03T03:42:54Z) - RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs [38.34856927170692]
We propose a training-free framework for analyzing trained Multimodal Large Language Model (MLLM)<n>It consists of Probe-Activated Dynamic FFN and Hollow Attention, which enable adjustable reductions in computations for visual tokens.<n>Experiments demonstrate substantial, structured, and clustered redundancy unique to decoder-only MLLMs.
arXiv Detail & Related papers (2025-01-31T11:09:16Z) - Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models [21.36437021964681]
"Global Compression Commander" is a novel token compression framework for HR-LVLMs.<n>GlobalCom$2$ maintains over 90% performance while compressing 90% visual tokens, reducing FLOPs and peak memory to 9.1% and 60%.
arXiv Detail & Related papers (2025-01-09T11:57:58Z) - Large Language Models for Lossless Image Compression: Next-Pixel Prediction in Language Space is All You Need [53.584140947828004]
Language large model (LLM) with unprecedented intelligence is a general-purpose lossless compressor for various data modalities.
We propose P$2$-LLM, a next-pixel prediction-based LLM, which integrates various elaborated insights and methodologies.
Experiments on benchmark datasets demonstrate that P$2$-LLM can beat SOTA classical and learned codecs.
arXiv Detail & Related papers (2024-11-19T12:15:40Z) - LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment [36.958867918858296]
Large language models (LLMs) have demonstrated their strong intelligence ability, but high demand for computation and storage hinders their practical application.
We present the Large Language Model Compression Benchmark (LLMCBench), a rigorously designed benchmark with an in-depth analysis for LLM compression algorithms.
arXiv Detail & Related papers (2024-10-28T14:45:01Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - VoCo-LLaMA: Towards Vision Compression with Large Language Models [31.398537194299752]
Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window.<n>We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs.<n>Our method achieves minimal performance loss with a compression ratio of 576$times$, resulting in up to 94.8$%$ fewer FLOPs and 69.6$%$ acceleration in inference time.
arXiv Detail & Related papers (2024-06-18T05:05:12Z) - LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z) - Long Context Compression with Activation Beacon [22.054232261437186]
Activation Beacon is a plug-in module for transformer-based LLMs.
It targets effective, efficient, and flexible compression of long contexts.
It achieves a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache.
arXiv Detail & Related papers (2024-01-07T11:57:40Z) - LLMLingua: Compressing Prompts for Accelerated Inference of Large
Language Models [22.06402870816756]
Large language models (LLMs) have been applied in various applications due to their astonishing capabilities.
This paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity.
We show that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss.
arXiv Detail & Related papers (2023-10-09T14:10:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.