PathAlign: A vision-language model for whole slide images in histopathology
- URL: http://arxiv.org/abs/2406.19578v1
- Date: Thu, 27 Jun 2024 23:43:36 GMT
- Title: PathAlign: A vision-language model for whole slide images in histopathology
- Authors: Faruk Ahmed, Andrew Sellergren, Lin Yang, Shawn Xu, Boris Babenko, Abbi Ward, Niels Olson, Arash Mohtashamian, Yossi Matias, Greg S. Corrado, Quang Duong, Dale R. Webster, Shravya Shetty, Daniel Golden, Yun Liu, David F. Steiner, Ellery Wulczyn,
- Abstract summary: We develop a vision-language model based on the BLIP-2 framework using WSIs and curated text from pathology reports.
This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest.
We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization.
- Score: 13.567674461880905
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision-language modeling raise new opportunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image-text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision-language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.
Related papers
- WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering [6.315841446240698]
We propose a novel framework to interpret whole slide images (WSIs) by generative visual question answering.
WSI-VQA shows universality by reframing various kinds of slide-level tasks in a question-answering pattern.
We establish a WSI-VQA dataset which contains 8672 slide-level question-answering pairs with 977 WSIs.
arXiv Detail & Related papers (2024-07-08T04:37:32Z) - Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings.
We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features.
Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z) - Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images.
Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z) - PathM3: A Multimodal Multi-Task Multiple Instance Learning Framework for Whole Slide Image Classification and Captioning [35.24716774767677]
We present PathM3, a multi-task, multiple instance learning framework for WSI classification and captioning.
Our method overcomes data scarcity in WSI-level captions by leveraging limited WSI diagnostic caption data.
arXiv Detail & Related papers (2024-03-13T21:19:12Z) - A self-supervised framework for learning whole slide representations [52.774822784847565]
We present Slide Pre-trained Transformers (SPT) for gigapixel-scale self-supervision of whole slide images.
We benchmark SPT visual representations on five diagnostic tasks across three biomedical microscopy datasets.
arXiv Detail & Related papers (2024-02-09T05:05:28Z) - WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images [5.960501267687475]
We investigate how to generate pathology reports given whole slide images.
We curated the largest WSI-text dataset (PathText)
On the model end, we propose the multiple instance generative model (MI-Gen)
arXiv Detail & Related papers (2023-11-27T05:05:41Z) - Long-MIL: Scaling Long Contextual Multiple Instance Learning for
Histopathology Whole Slide Image Analysis [9.912061800841267]
Whole Slide Image (WSI) of histopathology tissue is used for analysis.
Previous methods generally divide the WSI into a large number of patches, then aggregate all patches within a WSI to make the slide-level prediction.
We propose to amend position embedding for shape varying long-contextual WSI by introducing Linear Bias into Attention.
arXiv Detail & Related papers (2023-11-21T03:08:47Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Hierarchical Transformer for Survival Prediction Using Multimodality
Whole Slide Images and Genomics [63.76637479503006]
Learning good representation of giga-pixel level whole slide pathology images (WSI) for downstream tasks is critical.
This paper proposes a hierarchical-based multimodal transformer framework that learns a hierarchical mapping between pathology images and corresponding genes.
Our architecture requires fewer GPU resources compared with benchmark methods while maintaining better WSI representation ability.
arXiv Detail & Related papers (2022-11-29T23:47:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.