Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
- URL: http://arxiv.org/abs/2312.04746v2
- Date: Tue, 9 Apr 2024 21:48:42 GMT
- Title: Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
- Authors: Mehmet Saygin Seyfioglu, Wisdom O. Ikezogwo, Fatemeh Ghezloo, Ranjay Krishna, Linda Shapiro,
- Abstract summary: We introduce Quilt-Instruct, a large-scale dataset of histopathology-specific instruction question/answer pairs.
Using Quilt-Instruct, we train Quilt-LLaVA, which can reason beyond the given single image patch.
- Score: 11.913023311613884
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Diagnosis in histopathology requires a global whole slide images (WSIs) analysis, requiring pathologists to compound evidence from different WSI patches. The gigapixel scale of WSIs poses a challenge for histopathology multi-modal models. Training multi-model models for histopathology requires instruction tuning datasets, which currently contain information for individual image patches, without a spatial grounding of the concepts within each patch and without a wider view of the WSI. Therefore, they lack sufficient diagnostic capacity for histopathology. To bridge this gap, we introduce Quilt-Instruct, a large-scale dataset of 107,131 histopathology-specific instruction question/answer pairs, grounded within diagnostically relevant image patches that make up the WSI. Our dataset is collected by leveraging educational histopathology videos from YouTube, which provides spatial localization of narrations by automatically extracting the narrators' cursor positions. Quilt-Instruct supports contextual reasoning by extracting diagnosis and supporting facts from the entire WSI. Using Quilt-Instruct, we train Quilt-LLaVA, which can reason beyond the given single image patch, enabling diagnostic reasoning across patches. To evaluate Quilt-LLaVA, we propose a comprehensive evaluation dataset created from 985 images and 1283 human-generated question-answers. We also thoroughly evaluate Quilt-LLaVA using public histopathology datasets, where Quilt-LLaVA significantly outperforms SOTA by over 10% on relative GPT-4 score and 4% and 9% on open and closed set VQA. Our code, data, and model are publicly accessible at quilt-llava.github.io.
Related papers
- VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning [2.6954348706500766]
We present VideoPath-LLaVA, the first large multimodal model (LMM) in computational pathology.<n>It integrates three distinct image scenarios, single patch images, automatically-extracted clips, and manually segmented video pathology images.<n>By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, VideoPath-LLaVA bridges visual narratives with diagnostic reasoning.
arXiv Detail & Related papers (2025-05-07T07:41:19Z) - From Pixels to Histopathology: A Graph-Based Framework for Interpretable Whole Slide Image Analysis [81.19923502845441]
We develop a graph-based framework that constructs WSI graph representations.
We build tissue representations (nodes) that follow biological boundaries rather than arbitrary patches.
In our method's final step, we solve the diagnostic task through a graph attention network.
arXiv Detail & Related papers (2025-03-14T20:15:04Z) - Semantic Segmentation Based Quality Control of Histopathology Whole Slide Images [2.953447779233234]
We developed a software pipeline for quality control (QC) of histopathology whole slide images (WSIs)
It segments various regions, such as blurs of different levels, tissue regions, tissue folds, and pen marks.
It was evaluated in all TCGAs, which is the largest publicly available WSI dataset containing more than 11,000 histopathology images from 28 organs.
arXiv Detail & Related papers (2024-10-04T10:03:04Z) - PathAlign: A vision-language model for whole slide images in histopathology [13.567674461880905]
We develop a vision-language model based on the BLIP-2 framework using WSIs and curated text from pathology reports.
This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest.
We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization.
arXiv Detail & Related papers (2024-06-27T23:43:36Z) - A self-supervised framework for learning whole slide representations [52.774822784847565]
We present Slide Pre-trained Transformers (SPT) for gigapixel-scale self-supervision of whole slide images.
We benchmark SPT visual representations on five diagnostic tasks across three biomedical microscopy datasets.
arXiv Detail & Related papers (2024-02-09T05:05:28Z) - Learned representation-guided diffusion models for large-image generation [58.192263311786824]
We introduce a novel approach that trains diffusion models conditioned on embeddings from self-supervised learning (SSL)
Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images.
Augmenting real data by generating variations of real images improves downstream accuracy for patch-level and larger, image-scale classification tasks.
arXiv Detail & Related papers (2023-12-12T14:45:45Z) - WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images [5.960501267687475]
We investigate how to generate pathology reports given whole slide images.
We curated the largest WSI-text dataset (PathText)
On the model end, we propose the multiple instance generative model (MI-Gen)
arXiv Detail & Related papers (2023-11-27T05:05:41Z) - Quilt-1M: One Million Image-Text Pairs for Histopathology [10.263853626151297]
We use YouTube to curate a vision-language dataset consisting of $802, 144$ image and text pairs.
We combine QUILT with datasets from other sources, including Twitter, research papers, and the internet in general, to create QUILT-1M.
Our model outperforms state-of-the-art models on both zero-shot and linear probing tasks for classifying new histopathology images.
arXiv Detail & Related papers (2023-06-20T00:14:47Z) - Hierarchical Transformer for Survival Prediction Using Multimodality
Whole Slide Images and Genomics [63.76637479503006]
Learning good representation of giga-pixel level whole slide pathology images (WSI) for downstream tasks is critical.
This paper proposes a hierarchical-based multimodal transformer framework that learns a hierarchical mapping between pathology images and corresponding genes.
Our architecture requires fewer GPU resources compared with benchmark methods while maintaining better WSI representation ability.
arXiv Detail & Related papers (2022-11-29T23:47:56Z) - Towards Automatic Parsing of Structured Visual Content through the Use
of Synthetic Data [65.68384124394699]
We propose a synthetic dataset, containing Structured Visual Content (SVCs) in the form of images and ground truths.
We show the usage of this dataset by an application that automatically extracts a graph representation from an SVC image.
Our dataset enables the development of strong models for the interpretation of SVCs while skipping the time-consuming dense data annotation.
arXiv Detail & Related papers (2022-04-29T14:44:52Z) - Code-free development and deployment of deep segmentation models for
digital pathology [0.7812927717615301]
We present a code-free pipeline utilizing free-to-use, open-source software (QuPath, DeepMIB, and FastPathology) for creating and deploying deep learning-based segmentation models for computational pathology.
A dataset of 251 annotated WSIs, comprising 140 hematoxylin-eosin (HE)-stained and 111 CD3 immunostained colon biopsy WSIs, were developed through active learning using the pipeline.
We demonstrate pathologist-level segmentation accuracy and clinical runtime performance and show that pathologists without programming experience can create near state-of-the-art segmentation solutions.
arXiv Detail & Related papers (2021-11-16T13:08:05Z) - Deep Co-Attention Network for Multi-View Subspace Learning [73.3450258002607]
We propose a deep co-attention network for multi-view subspace learning.
It aims to extract both the common information and the complementary information in an adversarial setting.
In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation.
arXiv Detail & Related papers (2021-02-15T18:46:44Z) - Learning Contextualized Document Representations for Healthcare Answer
Retrieval [68.02029435111193]
Contextual Discourse Vectors (CDV) is a distributed document representation for efficient answer retrieval from long documents.
Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse.
We show that our generalized model significantly outperforms several state-of-the-art baselines for healthcare passage ranking.
arXiv Detail & Related papers (2020-02-03T15:47:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.