Related papers: LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit

LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit

URL: http://arxiv.org/abs/2508.09981v1
Date: Wed, 13 Aug 2025 17:54:49 GMT
Title: LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit
Authors: Chengtao Lv, Bilang Zhang, Yang Yong, Ruihao Gong, Yushi Huang, Shiqiao Gu, Jiajun Wu, Yumeng Shi, Jinyang Guo, Wenya Wang,
Abstract summary: Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands.<n>Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy.<n>We introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit.
Score: 29.877232989285833
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands, due to their long visual token sequences and massive parameter sizes. To address these issues, recent works have proposed training-free compression methods. However, existing efforts often suffer from three major limitations: (1) Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy. (2) Evaluation confined to simple single-turn tasks, failing to reflect performance in realistic scenarios. (3) Isolated use of individual compression techniques, without exploring their joint potential. To overcome these gaps, we introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit. LLMC+ supports over 20 algorithms across five representative VLM families and enables systematic study of token-level and model-level compression. Our benchmark reveals that: (1) Spatial and temporal redundancies demand distinct technical strategies. (2) Token reduction methods degrade significantly in multi-turn dialogue and detail-sensitive tasks. (3) Combining token and model compression achieves extreme compression with minimal performance loss. We believe LLMC+ will facilitate fair evaluation and inspire future research in efficient VLM. Our code is available at https://github.com/ModelTC/LightCompress.

Related papers

Arbitrary Ratio Feature Compression via Next Token Prediction [52.10426317889982]
Arbitrary Ratio Feature Compression (ARFC) framework supports any compression ratio with a single model.<n>ARC is an auto-regressive model that performs compression via next-gressive prediction.<n>MoS module refines the compressed tokens by utilizing multiple compression results.<n>ERGC is integrated into the training process to preserve semantic and structural relationships during compression.
arXiv Detail & Related papers (2026-02-12T02:38:57Z)
VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning [55.17170420615628]
Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks.<n>We propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process.<n>Our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency.
arXiv Detail & Related papers (2026-01-29T18:07:39Z)
Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models [30.433778463779618]
We present UniPruneBench, a benchmark for visual token pruning in multimodal models.<n>UniPruneBench provides standardized protocols across six ability dimensions and ten datasets.
arXiv Detail & Related papers (2025-11-04T15:17:06Z)
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs [82.72388893596555]
Multimodal Large Language Models (MLLMs) encounter significant computational and memory bottlenecks.<n>Previous token compression techniques are often constrained by rules that risk discarding critical information.<n>We reformulate token compression as a lightweight plug-and-play framework that reformulates token compression into an end-to-end learnable decision process.
arXiv Detail & Related papers (2025-10-18T17:54:18Z)
Compressing Many-Shots in In-Context Learning [61.231471139896506]
We study an approach to improve the memory and computational efficiency of ICL inference by compressing the many-shot prompts.<n>We first show that existing prompt compression methods are ineffective for many-shot compression.<n>We propose MemCom, a layer-wise compression method.
arXiv Detail & Related papers (2025-10-17T16:57:42Z)
LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models [62.240460476785934]
We propose LaCo (Layer-wise Visual Token Compression), a novel framework that enables effective token compression within the intermediate layers of the vision encoder.<n>LaCo introduces two core components: 1) a layer-wise pixel-shuffle mechanism that systematically merges adjacent tokens through space-to-channel transformations, and 2) a residual learning architecture with non-parametric shortcuts.
arXiv Detail & Related papers (2025-07-03T03:42:54Z)
RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs [38.34856927170692]
We propose a training-free framework for analyzing trained Multimodal Large Language Model (MLLM)<n>It consists of Probe-Activated Dynamic FFN and Hollow Attention, which enable adjustable reductions in computations for visual tokens.<n>Experiments demonstrate substantial, structured, and clustered redundancy unique to decoder-only MLLMs.
arXiv Detail & Related papers (2025-01-31T11:09:16Z)
Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models [21.36437021964681]
"Global Compression Commander" is a novel token compression framework for HR-LVLMs.<n>GlobalCom$2$ maintains over 90% performance while compressing 90% visual tokens, reducing FLOPs and peak memory to 9.1% and 60%.
arXiv Detail & Related papers (2025-01-09T11:57:58Z)
Large Language Models for Lossless Image Compression: Next-Pixel Prediction in Language Space is All You Need [53.584140947828004]
Language large model (LLM) with unprecedented intelligence is a general-purpose lossless compressor for various data modalities. We propose P$2$-LLM, a next-pixel prediction-based LLM, which integrates various elaborated insights and methodologies. Experiments on benchmark datasets demonstrate that P$2$-LLM can beat SOTA classical and learned codecs.
arXiv Detail & Related papers (2024-11-19T12:15:40Z)
LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment [36.958867918858296]
Large language models (LLMs) have demonstrated their strong intelligence ability, but high demand for computation and storage hinders their practical application. We present the Large Language Model Compression Benchmark (LLMCBench), a rigorously designed benchmark with an in-depth analysis for LLM compression algorithms.
arXiv Detail & Related papers (2024-10-28T14:45:01Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
VoCo-LLaMA: Towards Vision Compression with Large Language Models [31.398537194299752]
Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window.<n>We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs.<n>Our method achieves minimal performance loss with a compression ratio of 576$times$, resulting in up to 94.8$%$ fewer FLOPs and 69.6$%$ acceleration in inference time.
arXiv Detail & Related papers (2024-06-18T05:05:12Z)
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models. We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z)
Long Context Compression with Activation Beacon [22.054232261437186]
Activation Beacon is a plug-in module for transformer-based LLMs. It targets effective, efficient, and flexible compression of long contexts. It achieves a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache.
arXiv Detail & Related papers (2024-01-07T11:57:40Z)
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models [22.06402870816756]
Large language models (LLMs) have been applied in various applications due to their astonishing capabilities. This paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity. We show that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss.
arXiv Detail & Related papers (2023-10-09T14:10:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.