Efficient Whole Slide Pathology VQA via Token Compression
- URL: http://arxiv.org/abs/2507.14497v1
- Date: Sat, 19 Jul 2025 06:04:25 GMT
- Title: Efficient Whole Slide Pathology VQA via Token Compression
- Authors: Weimin Lyu, Qingqiao Hu, Kehan Qi, Zhan Shi, Wentao Huang, Saumya Gupta, Chao Chen,
- Abstract summary: Whole-slide images (WSIs) in pathology can reach up to 10,000 x 10,000 pixels, posing significant challenges for large language model (MLLM)<n>We propose Token Compression Pathology LLaVA ( TCP-LLaVA), the first MLLM architecture to perform WSI VQA via token compression.
- Score: 10.122347041204629
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Whole-slide images (WSIs) in pathology can reach up to 10,000 x 10,000 pixels, posing significant challenges for multimodal large language model (MLLM) due to long context length and high computational demands. Previous methods typically focus on patch-level analysis or slide-level classification using CLIP-based models with multi-instance learning, but they lack the generative capabilities needed for visual question answering (VQA). More recent MLLM-based approaches address VQA by feeding thousands of patch tokens directly into the language model, which leads to excessive resource consumption. To address these limitations, we propose Token Compression Pathology LLaVA (TCP-LLaVA), the first MLLM architecture to perform WSI VQA via token compression. TCP-LLaVA introduces a set of trainable compression tokens that aggregate visual and textual information through a modality compression module, inspired by the [CLS] token mechanism in BERT. Only the compressed tokens are forwarded to the LLM for answer generation, significantly reducing input length and computational cost. Experiments on ten TCGA tumor subtypes show that TCP-LLaVA outperforms existing MLLM baselines in VQA accuracy while reducing training resource consumption by a substantial margin.
Related papers
- QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining [28.2730962800806]
We propose a drop-in replacement for CLIP vision encoders that can be seamlessly integrated with existing MLLMs.<n>QLIP improves the general visual question answering accuracy of the LLaVA v1.5 model series across various model sizes.<n> Notably, QLIP boosts detailed understanding performance on the challenging $Vast$ benchmark by up to 13.6 percent.
arXiv Detail & Related papers (2025-05-29T02:26:34Z) - VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits [9.470124763460904]
VQ-Logits is a novel approach to drastically reduce the parameter count and computational load of the Large Language Models output layer.<n>We show that VQ-Logits can achieve up to 99% parameter reduction in the output layer and 6x speedup in logit computation, with only a marginal 4% increase in perplexity compared to full softmax baselines.
arXiv Detail & Related papers (2025-05-15T11:58:04Z) - TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation [80.90309237362526]
TokLIP is a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens.<n>TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics.
arXiv Detail & Related papers (2025-05-08T17:12:19Z) - QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA [16.494799458292]
Images often contain more redundant information than text, and not all visual details are pertinent to specific questions.<n>We propose QG-VTC, a novel question-guided visual token compression method for MLLM-based VQA tasks.<n>QG-VTC employs a pretrained text encoder and a learnable feed-forward layer to embed user questions into the vision encoder's feature space.
arXiv Detail & Related papers (2025-04-01T11:07:19Z) - Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models [50.214593234229255]
We introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens.<n>On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark.
arXiv Detail & Related papers (2025-03-21T09:46:31Z) - SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories [52.57696897619189]
We introduce the Human-Like Mask Modeling Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools.<n>HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens.<n>HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task.
arXiv Detail & Related papers (2025-03-11T17:08:54Z) - Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption.<n> compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.<n>We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z) - PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via Existing MLLM Structures [5.513631883813244]
We propose a framework that textbfPre-textbfIntegratestextbfPrompt information into the visual encoding process using existingmodules of MLLMs.
Our model maintains excellent generation even when half of the visual tokens are reduced.
arXiv Detail & Related papers (2024-10-30T15:05:17Z) - Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [56.00251589760559]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.<n>We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.<n>Experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally.
This neglects their different informativeness and leads to a significant increase in the number of image tokens.
We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z) - VoCo-LLaMA: Towards Vision Compression with Large Language Models [31.398537194299752]
Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window.<n>We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs.<n>Our method achieves minimal performance loss with a compression ratio of 576$times$, resulting in up to 94.8$%$ fewer FLOPs and 69.6$%$ acceleration in inference time.
arXiv Detail & Related papers (2024-06-18T05:05:12Z) - Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference [59.91176945361035]
We introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference.<n>VTW strategically withdraws vision tokens at a certain layer, enabling only text tokens to engage in subsequent layers.<n>Our approach can cut computational overhead by over 40% across diverse multimodal tasks while maintaining performance.
arXiv Detail & Related papers (2024-05-09T14:38:53Z) - LLMLingua: Compressing Prompts for Accelerated Inference of Large
Language Models [22.06402870816756]
Large language models (LLMs) have been applied in various applications due to their astonishing capabilities.
This paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity.
We show that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss.
arXiv Detail & Related papers (2023-10-09T14:10:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.