Related papers: On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression

On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression

URL: http://arxiv.org/abs/2601.21531v1
Date: Thu, 29 Jan 2026 10:47:21 GMT
Title: On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression
Authors: Xinwei Zhang, Hangcheng Liu, Li Bai, Hao Wang, Qingqing Ye, Tianwei Zhang, Haibo Hu,
Abstract summary: We show that existing encoder-based attacks can substantially overestimate the robustness of compressed vision-language models (LVLMs)<n>We propose the Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with compression inference without assuming access to the deployed compression mechanism or its token budget.
Score: 22.436953683970007
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Visual token compression is widely used to accelerate large vision-language models (LVLMs) by pruning or merging visual tokens, yet its adversarial robustness remains unexplored. We show that existing encoder-based attacks can substantially overestimate the robustness of compressed LVLMs, due to an optimization-inference mismatch: perturbations are optimized on the full-token representation, while inference is performed through a token-compression bottleneck. To address this gap, we propose the Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with compression inference without assuming access to the deployed compression mechanism or its token budget. CAGE combines (i) expected feature disruption, which concentrates distortion on tokens likely to survive across plausible budgets, and (ii) rank distortion alignment, which actively aligns token distortions with rank scores to promote the retention of highly distorted evidence. Across diverse representative plug-and-play compression mechanisms and datasets, our results show that CAGE consistently achieves lower robust accuracy than the baseline. This work highlights that robustness assessments ignoring compression can be overly optimistic, calling for compression-aware security evaluation and defenses for efficient LVLMs.

Related papers

Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics [22.98826013817833]
We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing.<n>We find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy.<n>We identify representational rigidity, where excessive head-level consensus collapses routing flexibility despite token survival.
arXiv Detail & Related papers (2026-03-02T04:16:36Z)
Arbitrary Ratio Feature Compression via Next Token Prediction [52.10426317889982]
Arbitrary Ratio Feature Compression (ARFC) framework supports any compression ratio with a single model.<n>ARC is an auto-regressive model that performs compression via next-gressive prediction.<n>MoS module refines the compressed tokens by utilizing multiple compression results.<n>ERGC is integrated into the training process to preserve semantic and structural relationships during compression.
arXiv Detail & Related papers (2026-02-12T02:38:57Z)
MTC-VAE: Multi-Level Temporal Compression with Content Awareness [54.85288415164888]
Latent Video Diffusion Models (LVDMs) rely on Variational Autoencoders (VAEs) to compress videos into compact latent representations.<n>We present a technique to convert fixed compression rate VAEs into models that support multi-level temporal compression.
arXiv Detail & Related papers (2026-02-01T17:08:02Z)
CSGaussian: Progressive Rate-Distortion Compression and Segmentation for 3D Gaussian Splatting [57.73006852239138]
We present the first unified framework for rate-distortion-optimized compression and segmentation of 3D Gaussian Splatting (3DGS)<n>Inspired by recent advances in rate-distortion-optimized 3DGS compression, this work integrates semantic learning into the compression pipeline to support decoder-side applications.<n>Our scheme features a lightweight implicit neural representation-based hyperprior, enabling efficient entropy coding of both color and semantic attributes.
arXiv Detail & Related papers (2026-01-19T08:21:45Z)
Less Is More -- Until It Breaks: Security Pitfalls of Vision Token Compression in Large Vision-Language Models [69.84867664371826]
We show that visual token compression substantially degrades the robustness of Large Vision-Language Models (LVLMs)<n>Small and imperceptible perturbations can significantly alter token importance ranking, leading the compression mechanism to mistakenly discard task-critical information.<n>We propose a Compression-Aware Attack to systematically study and exploit this vulnerability.
arXiv Detail & Related papers (2026-01-17T13:02:41Z)
Adaptive-VoCo: Complexity-Aware Visual Token Compression for Vision-Language Models [19.536595270049016]
We propose Adaptive-VoCo, a framework that augments VoCo-LLaMA with a lightweight predictor for adaptive compression.<n> Experimental results show that our method consistently outperforms fixed-rate baselines across multiple multimodal tasks.
arXiv Detail & Related papers (2025-12-20T20:24:07Z)
UniComp: Rethinking Video Compression Through Informational Uniqueness [16.98296446798904]
UniComp aims to maximize the information fidelity of video representations under constrained computational budgets.<n>We introduce the notion of information uniqueness to measure intrinsic redundancy among tokens to link with reconstruction error.
arXiv Detail & Related papers (2025-12-03T08:56:23Z)
CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs [29.08277140543501]
We introduce CORE (Compact Object-centric REpresentations), a new paradigm for visual token compression.<n> CORE leverages an efficient segmentation decoder to generate object masks, which serve as a high-level semantic prior to guide the merging of visual tokens.<n>Experiments show that CORE not only establishes a new state-of-the-art on six authoritative benchmarks for fixed-rate compression, but also achieves dramatic efficiency gains in adaptive-rate settings.
arXiv Detail & Related papers (2025-11-18T03:02:23Z)
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs [82.72388893596555]
Multimodal Large Language Models (MLLMs) encounter significant computational and memory bottlenecks.<n>Previous token compression techniques are often constrained by rules that risk discarding critical information.<n>We reformulate token compression as a lightweight plug-and-play framework that reformulates token compression into an end-to-end learnable decision process.
arXiv Detail & Related papers (2025-10-18T17:54:18Z)
Contextual Compression Encoding for Large Language Models: A Novel Framework for Multi-Layered Parameter Space Pruning [0.0]
Contextual Compression.<n>(CCE) introduced a multi-stage encoding mechanism that dynamically restructured parameter distributions.<n>CCE retained linguistic expressivity and coherence, maintaining accuracy across a range of text generation and classification tasks.
arXiv Detail & Related papers (2025-02-12T11:44:19Z)
UNComp: Can Matrix Entropy Uncover Sparsity? -- A Compressor Design from an Uncertainty-Aware Perspective [85.08718140718707]
UNComp is an uncertainty-aware framework that uncovers sparsity patterns that can be used for adaptive compression.<n>By focusing on uncertainty to analyze the sparsity pattern in detail, UNComp reduces the KV cache size to 4.74% of the original, achieves a 6% prefill speedup, and improves throughput by 6.4x.
arXiv Detail & Related papers (2024-10-04T02:32:36Z)
Once-for-All: Controllable Generative Image Compression with Dynamic Granularity Adaptation [52.82508784748278]
This paper proposes a Control Generative Image Compression framework, termed Control-GIC.<n>Control-GIC is capable of fine-grained adaption across a broad spectrum while ensuring high-fidelity and generality compression.<n>Our experiments show that Control-GIC allows highly flexible and controllable adaption where the results demonstrate its superior performance over recent state-of-the-art methods.
arXiv Detail & Related papers (2024-06-02T14:22:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.