Navigating Gigapixel Pathology Images with Large Multimodal Models
- URL: http://arxiv.org/abs/2511.19652v1
- Date: Mon, 24 Nov 2025 19:33:56 GMT
- Title: Navigating Gigapixel Pathology Images with Large Multimodal Models
- Authors: Thomas A. Buckley, Kian R. Weihrauch, Katherine Latham, Andrew Z. Zhou, Padmini A. Manrai, Arjun K. Manrai,
- Abstract summary: General-purpose large multimodal models (LMMs) have generally shown poor or inconclusive performance in medical image interpretation.<n>We introduce Gigapixel Image Agent for Navigating Tissue (GIANT), the first framework that allows LMMs to iteratively navigate whole-slide images like a pathologist.<n>Using MultiPathQA, we show that our simple agentic system substantially outperforms conventional patch- and thumbnail-based baselines.
- Score: 0.649324006529432
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite being widely used to support clinical care, general-purpose large multimodal models (LMMs) have generally shown poor or inconclusive performance in medical image interpretation, particularly in pathology, where gigapixel images are used. However, prior studies have used either low-resolution thumbnails or random patches, which likely underestimated model performance. Here, we ask whether LMMs can be adapted to reason coherently and accurately in the evaluation of such images. In this study, we introduce Gigapixel Image Agent for Navigating Tissue (GIANT), the first framework that allows LMMs to iteratively navigate whole-slide images (WSIs) like a pathologist. Accompanying GIANT, we release MultiPathQA, a new benchmark, which comprises 934 WSI-level questions, encompassing five clinically-relevant tasks ranging from cancer diagnosis to open-ended reasoning. MultiPathQA also includes 128 questions, authored by two professional pathologists, requiring direct slide interpretation. Using MultiPathQA, we show that our simple agentic system substantially outperforms conventional patch- and thumbnail-based baselines, approaching or surpassing the performance of specialized models trained on millions of images. For example, on pathologist-authored questions, GPT-5 with GIANT achieves 62.5% accuracy, outperforming specialist pathology models such as TITAN (43.8%) and SlideChat (37.5%). Our findings reveal the strengths and limitations of current foundation models and ground future development of LMMs for expert reasoning in pathology.
Related papers
- Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z) - MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine [12.333678882957377]
We introduce Medblink, a benchmark designed to probe these models for such perceptual abilities.<n>Medblink spans eight clinically meaningful tasks across multiple imaging modalities and anatomical regions, totaling 1,429 multiple-choice questions over 1,605 images.<n>While human annotators achieve 96.4% accuracy, the best-performing model reaches only 65%.
arXiv Detail & Related papers (2025-08-04T23:19:18Z) - Evidence-based diagnostic reasoning with multi-agent copilot for human pathology [7.976907866539546]
Current multimodal large language models (MLLMs) in computational pathology face limitations.<n>We introduce PathChat+, a new MLLM specifically designed for human pathology, trained on over 1 million diverse, pathology-specific instruction samples.<n>We also present SlideSeek, a reasoning-enabled multi-agent AI system leveraging PathChat+ to autonomously evaluate gigapixel whole-slide images.
arXiv Detail & Related papers (2025-06-26T03:02:16Z) - MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning [24.9872402922819]
Existing medical VQA benchmarks mostly focus on single-image analysis.<n>We introduce MedFrameQA -- the first benchmark that explicitly evaluates multi-image reasoning in medical VQA.
arXiv Detail & Related papers (2025-05-22T17:46:11Z) - PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation [18.734721574528702]
We demonstrate the ability to generate diagnoses from up to 40,000 768x pixel image patches from multiple whole-slide images at 10X magnification.<n>Expert pathologist evaluations demonstrate that the generated report text is clinically accurate and equivalent to or preferred over the original reporting.
arXiv Detail & Related papers (2025-02-14T20:09:13Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception [66.88792390480343]
We propose DEEM, a simple but effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder.<n>DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data, and a smaller base model size.
arXiv Detail & Related papers (2024-05-24T05:46:04Z) - PLUTO: Pathology-Universal Transformer [4.920983796208486]
We propose PathoLogy Universal TransfOrmer (PLUTO): a light-weight pathology FM that is pre-trained on a diverse dataset of 195 million image tiles.
We design task-specific adaptation heads that utilize PLUTO's output embeddings for tasks which span pathology scales.
We find that PLUTO matches or outperforms existing task-specific baselines and pathology-specific foundation models.
arXiv Detail & Related papers (2024-05-13T16:40:17Z) - Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology.
For training, we assemble a large dataset of over 697 thousand radiology image-text pairs.
For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation.
The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z) - Large-scale Long-tailed Disease Diagnosis on Radiology Images [51.453990034460304]
RadDiag is a foundational model supporting 2D and 3D inputs across various modalities and anatomies.
Our dataset, RP3D-DiagDS, contains 40,936 cases with 195,010 scans covering 5,568 disorders.
arXiv Detail & Related papers (2023-12-26T18:20:48Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Modality Completion via Gaussian Process Prior Variational Autoencoders
for Multi-Modal Glioma Segmentation [75.58395328700821]
We propose a novel model, Multi-modal Gaussian Process Prior Variational Autoencoder (MGP-VAE), to impute one or more missing sub-modalities for a patient scan.
MGP-VAE can leverage the Gaussian Process (GP) prior on the Variational Autoencoder (VAE) to utilize the subjects/patients and sub-modalities correlations.
We show the applicability of MGP-VAE on brain tumor segmentation where either, two, or three of four sub-modalities may be missing.
arXiv Detail & Related papers (2021-07-07T19:06:34Z) - Universal Model for Multi-Domain Medical Image Retrieval [88.67940265012638]
Medical Image Retrieval (MIR) helps doctors quickly find similar patients' data.
MIR is becoming increasingly helpful due to the wide use of digital imaging modalities.
However, the popularity of various digital imaging modalities in hospitals also poses several challenges to MIR.
arXiv Detail & Related papers (2020-07-14T23:22:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.