Related papers: LoC-Path: Learning to Compress for Pathology Multimodal Large Language Models

LoC-Path: Learning to Compress for Pathology Multimodal Large Language Models

URL: http://arxiv.org/abs/2512.05391v2
Date: Thu, 11 Dec 2025 16:04:05 GMT
Title: LoC-Path: Learning to Compress for Pathology Multimodal Large Language Models
Authors: Qingqiao Hu, Weimin Lyu, Meilong Xu, Kehan Qi, Xiaoling Hu, Saumya Gupta, Jiawei Zhou, Chao Chen,
Abstract summary: Whole Slide Image (WSI) understanding is fundamentally challenging due to its gigapixel scale and the extreme sparsity of diagnostically relevant regions.<n>Existing slide-level multimodal large language models (MLLMs) for pathology rely on heavy slide-level encoders.<n>We introduce an efficient MLLM framework, called LoC-Path, that replaces the expensive slide-level encoder with redundancy-reducing modules.
Score: 19.89635786218384
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Whole Slide Image (WSI) understanding is fundamentally challenging due to its gigapixel scale and the extreme sparsity of diagnostically relevant regions. Unlike human experts who primarily rely on key areas to arrive at a diagnosis, existing slide-level multimodal large language models (MLLMs) for pathology rely on heavy slide-level encoders that process thousands of patch features in a brute-force manner, resulting in excessive computational cost. In this work, we revisit the WSI-language modeling paradigm and show that tile-level features exhibit strong global and local redundancy, whereas only a small subset of tiles are truly task-relevant. Motivated by this observation, we introduce an efficient MLLM framework, called LoC-Path, that replaces the expensive slide-level encoder with redundancy-reducing modules. We first design a Sparse Token Merger (STM) and an MAE-pretrained resampler to remove local redundancy and compress globally redundant tile tokens into a compact slide-level representation set. We then propose a Cross-Attention Routing Adapter (CARA) and a Token Importance Scorer (TIS) to integrate the compressed visual representation with the language model in a computation-efficient manner. Extensive experiments demonstrate that our approach achieves performance comparable to existing state-of-the-art whole-slide MLLMs, while requiring significantly lower computation and memory.

Related papers

PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs [59.78917775399492]
Multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability.<n>We propose a training-free framework to mitigate this degradation.
arXiv Detail & Related papers (2026-01-12T15:27:51Z)
From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations [14.0185129202898]
BoxPromptIML is a novel weakly-supervised IML framework that balances annotation cost and localization performance.<n>Inspired by the human subconscious memory mechanism, our feature fusion module employs a dual-guidance strategy that actively contextualizes recalled patterns with real-time observational cues.
arXiv Detail & Related papers (2025-11-25T14:39:17Z)
A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z)
Libra-MIL: Multimodal Prototypes Stereoscopic Infused with Task-specific Language Priors for Few-shot Whole Slide Image Classification [18.928408687991368]
Large Language Models (LLMs) are emerging as a promising direction in computational pathology.<n>Existing vision-language Multi-Instance Learning (MIL) methods often employ unidirectional guidance.<n>We introduce a novel approach, Multimodal Prototype-based Multi-Instance Learning, that promotes bidirectional interaction.
arXiv Detail & Related papers (2025-11-11T07:46:38Z)
PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity [39.98516860109934]
PixelRefer is a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions.<n>Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite.<n>To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset.
arXiv Detail & Related papers (2025-10-27T17:59:32Z)
HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models [50.31704374968706]
Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding.<n>They typically require extremely high computational resources for training to achieve cross-modal alignment at multi-granularity levels.<n>We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels.
arXiv Detail & Related papers (2025-10-23T08:16:44Z)
Sparse Training Scheme for Multimodal LLM [26.81140959413325]
Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance across a variety of domains.<n>We propose a novel training-efficient framework based on sparse representations, termed the Sparse Training Scheme (STS)<n>This scheme consists of two key components: the Visual Token, which reduces the information load by compressing visual tokens, and the Layer Dynamic Skipper, which mitigates the computational overhead by skipping unnecessary layers in the language model during both forward and backward passes.
arXiv Detail & Related papers (2025-09-16T11:33:20Z)
BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion [6.8723394189831035]
Large language models pose challenges for deployment in resource-constrained environments.<n>We propose a lightweight MLLM framework for end-to-end visual question answering.<n>Our proposed approach centres on BreezeCLIP, a compact yet powerful vision-language optimised for efficient multimodal understanding.
arXiv Detail & Related papers (2025-09-10T16:09:49Z)
LatentLLM: Attention-Aware Joint Tensor Compression [50.33925662486034]
Large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources.<n>We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure.
arXiv Detail & Related papers (2025-05-23T22:39:54Z)
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding [66.74446220401296]
We propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation.<n>We introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding.<n>Our code and models shall be released.
arXiv Detail & Related papers (2024-12-12T18:59:26Z)
Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [56.00251589760559]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.<n>We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.<n>Experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods.
arXiv Detail & Related papers (2024-09-26T13:38:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.