Related papers: WISE-FUSE: Efficient Whole Slide Image Encoding via Coarse-to-Fine Patch Selection with VLM and LLM Knowledge Fusion

WISE-FUSE: Efficient Whole Slide Image Encoding via Coarse-to-Fine Patch Selection with VLM and LLM Knowledge Fusion

URL: http://arxiv.org/abs/2508.14537v1
Date: Wed, 20 Aug 2025 08:41:19 GMT
Title: WISE-FUSE: Efficient Whole Slide Image Encoding via Coarse-to-Fine Patch Selection with VLM and LLM Knowledge Fusion
Authors: Yonghan Shin, SeungKyu Kim, Won-Ki Jeong,
Abstract summary: Whole slide images (WSIs) in computational pathology (CPath) pose a major computational challenge due to their gigapixel scale.<n>We propose WISE-FUSE, an adaptive WSI encoding framework that leverages pathology-domain vision-language models and large language models.<n>We show that WISE-FUSE reduces WSI encoding time by over threefold while achieving diagnostic performance comparable to or surpassing that of exhaustive patch processing.
Score: 3.677055050765245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Whole slide images (WSIs) in computational pathology (CPath) pose a major computational challenge due to their gigapixel scale, often requiring the processing of tens to hundreds of thousands of high-resolution patches per slide. This results in prohibitive encoding costs, with preprocessing and training times extending to days or even weeks-making WSI encoding the most significant bottleneck in real-world deployment. In this work, we propose WISE-FUSE, an adaptive WSI encoding framework that leverages pathology-domain vision-language models and large language models to address this challenge by selectively processing diagnostically relevant regions. WISE-FUSE first computes similarity scores between low-resolution patches and class-specific textual descriptions using a knowledge distillation mechanism that preserves fine-grained diagnostic features. Based on these similarity scores, we select a small subset of informative regions for the target task, which quickly eliminates irrelevant patches at the coarse level. The corresponding high-resolution patches are then selectively encoded and fused with textual embeddings to reinforce diagnostic context. Extensive experiments demonstrate that WISE-FUSE reduces WSI encoding time by over threefold while achieving diagnostic performance comparable to or surpassing that of exhaustive patch processing, offering a scalable and practical solution for CPath.

Related papers

AtlasPatch: An Efficient and Scalable Tool for Whole Slide Image Preprocessing in Computational Pathology [18.45768749525754]
We present AtlasPatch, a slide preprocessing framework for accurate tissue detection and high-specified patch extraction.<n>The tool extrapolates tissue masks from thumbnails to full-resolution slides, with options to stream patches directly into common image encoders for embedding or store patch images.<n>We assess AtlasPatch across segmentation precision, computational complexity, and downstream multiple-instance learning, matching state-of-the-art performance while operating at a fraction of their computational cost.
arXiv Detail & Related papers (2026-02-03T20:32:07Z)
EvoPS: Evolutionary Patch Selection for Whole Slide Image Analysis in Computational Pathology [0.0]
We propose EvoPS, a novel framework that formulates patch selection as a multi-objective optimization problem.<n>We validated our framework across four major cancer cohorts from The Cancer Genome Atlas.
arXiv Detail & Related papers (2025-11-10T19:07:44Z)
Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs [51.93737995405164]
Large Vision-Language Models (LVLMs) are susceptible to hallucinations.<n>We introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy.<n>We show that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
arXiv Detail & Related papers (2025-05-26T08:36:10Z)
Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detection [50.343419243749054]
Anomaly detection is critical in fields such as medical diagnostics and industrial defect detection.<n> CLIP's coarse-grained image-text alignment limits localization and detection performance for fine-grained anomalies.<n>Crane improves the state-of-the-art ZSAD from 2% to 28%, at both image and pixel levels, while remaining competitive in inference speed.
arXiv Detail & Related papers (2025-04-15T10:42:25Z)
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection [54.21851618853518]
We present a concise yet effective approach called Patch Generation-to-Selection to enhance CLIP's training efficiency.<n>Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks.
arXiv Detail & Related papers (2025-03-21T12:10:38Z)
From Pixels to Histopathology: A Graph-Based Framework for Interpretable Whole Slide Image Analysis [81.19923502845441]
We develop a graph-based framework that constructs WSI graph representations.<n>We build tissue representations (nodes) that follow biological boundaries rather than arbitrary patches.<n>In our method's final step, we solve the diagnostic task through a graph attention network.
arXiv Detail & Related papers (2025-03-14T20:15:04Z)
Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning [51.525891360380285]
HDMIL is a hierarchical distillation multi-instance learning framework that achieves fast and accurate classification by eliminating irrelevant patches.<n> HDMIL consists of two key components: the dynamic multi-instance network (DMIN) and the lightweight instance pre-screening network (LIPN)
arXiv Detail & Related papers (2025-02-28T15:10:07Z)
Efficient Whole Slide Image Classification through Fisher Vector Representation [2.4472081831862655]
This study introduces a novel method for WSI classification by automating the identification and examination of the most informative patches. Our method involves two-stages: firstly, it extracts only a few patches from the WSIs based on their pathological significance; and secondly, it employs Fisher vectors for representing features extracted from these patches. This approach not only accentuates key pathological features within the WSI representation but also significantly reduces computational overhead, thus making the process more efficient and scalable.
arXiv Detail & Related papers (2024-11-13T11:24:12Z)
PathAlign: A vision-language model for whole slide images in histopathology [13.567674461880905]
We develop a vision-language model based on the BLIP-2 framework using WSIs and curated text from pathology reports. This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization.
arXiv Detail & Related papers (2024-06-27T23:43:36Z)
SPLICE -- Streamlining Digital Pathology Image Processing [0.7852714805965528]
We propose an unsupervised patching algorithm, Sequential Patching Lattice for Image Classification and Enquiry (SPLICE) SPLICE condenses a histopathology WSI into a compact set of representative patches, forming a "collage" of WSI while minimizing redundancy. As an unsupervised method, SPLICE effectively reduces storage requirements for representing tissue images by 50%.
arXiv Detail & Related papers (2024-04-26T21:30:36Z)
PathM3: A Multimodal Multi-Task Multiple Instance Learning Framework for Whole Slide Image Classification and Captioning [35.24716774767677]
We present PathM3, a multi-task, multiple instance learning framework for WSI classification and captioning. Our method overcomes data scarcity in WSI-level captions by leveraging limited WSI diagnostic caption data.
arXiv Detail & Related papers (2024-03-13T21:19:12Z)
Hierarchical Transformer for Survival Prediction Using Multimodality Whole Slide Images and Genomics [63.76637479503006]
Learning good representation of giga-pixel level whole slide pathology images (WSI) for downstream tasks is critical. This paper proposes a hierarchical-based multimodal transformer framework that learns a hierarchical mapping between pathology images and corresponding genes. Our architecture requires fewer GPU resources compared with benchmark methods while maintaining better WSI representation ability.
arXiv Detail & Related papers (2022-11-29T23:47:56Z)
Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion [62.269219152425556]
segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field. We propose a Differentiable Binarization (DB) module that integrates the binarization process into a segmentation network. An efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively.
arXiv Detail & Related papers (2022-02-21T15:30:14Z)
Weakly-supervised Learning For Catheter Segmentation in 3D Frustum Ultrasound [74.22397862400177]
We propose a novel Frustum ultrasound based catheter segmentation method. The proposed method achieved the state-of-the-art performance with an efficiency of 0.25 second per volume.
arXiv Detail & Related papers (2020-10-19T13:56:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.