Quilt-1M: One Million Image-Text Pairs for Histopathology
- URL: http://arxiv.org/abs/2306.11207v3
- Date: Fri, 27 Oct 2023 23:56:33 GMT
- Title: Quilt-1M: One Million Image-Text Pairs for Histopathology
- Authors: Wisdom Oluchi Ikezogwo, Mehmet Saygin Seyfioglu, Fatemeh Ghezloo,
Dylan Stefan Chan Geva, Fatwir Sheikh Mohammed, Pavan Kumar Anand, Ranjay
Krishna, Linda Shapiro
- Abstract summary: We use YouTube to curate a vision-language dataset consisting of $802, 144$ image and text pairs.
We combine QUILT with datasets from other sources, including Twitter, research papers, and the internet in general, to create QUILT-1M.
Our model outperforms state-of-the-art models on both zero-shot and linear probing tasks for classifying new histopathology images.
- Score: 10.263853626151297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent accelerations in multi-modal applications have been made possible with
the plethora of image and text data available online. However, the scarcity of
analogous data in the medical field, specifically in histopathology, has slowed
comparable progress. To enable similar representation learning for
histopathology, we turn to YouTube, an untapped resource of videos, offering
$1,087$ hours of valuable educational histopathology videos from expert
clinicians. From YouTube, we curate QUILT: a large-scale vision-language
dataset consisting of $802, 144$ image and text pairs. QUILT was automatically
curated using a mixture of models, including large language models, handcrafted
algorithms, human knowledge databases, and automatic speech recognition. In
comparison, the most comprehensive datasets curated for histopathology amass
only around $200$K samples. We combine QUILT with datasets from other sources,
including Twitter, research papers, and the internet in general, to create an
even larger dataset: QUILT-1M, with $1$M paired image-text samples, marking it
as the largest vision-language histopathology dataset to date. We demonstrate
the value of QUILT-1M by fine-tuning a pre-trained CLIP model. Our model
outperforms state-of-the-art models on both zero-shot and linear probing tasks
for classifying new histopathology images across $13$ diverse patch-level
datasets of $8$ different sub-pathologies and cross-modal retrieval tasks.
Related papers
- BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature [73.39593644054865]
BIOMEDICA is a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset.
Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles.
BMCA-CLIP is a suite of CLIP-style models continuously pretrained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally.
arXiv Detail & Related papers (2025-01-13T09:58:03Z) - Multimodal Medical Disease Classification with LLaMA II [0.14999444543328289]
We use the text-image pair dataset from OpenI consisting of 2D chest X-rays associated with clinical reports.
Our focus is on fusion methods for merging text and vision information extracted from medical datasets.
The newly introduced multimodal architecture can be applied to other multimodal datasets with little effort and can be easily adapted for further research.
arXiv Detail & Related papers (2024-12-02T09:18:07Z) - LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration [14.979275480422213]
Vision Language Models (VLMs) like CLIP have attracted substantial attention in pathology.
Current efforts to train pathology VLMs rely on pathology image-text pairs from platforms like PubMed, YouTube, and Twitter.
We leverage large-scale WSI datasets like TCGA to extract numerous high-quality image patches.
arXiv Detail & Related papers (2024-06-28T19:18:09Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Learning Multimodal Volumetric Features for Large-Scale Neuron Tracing [72.45257414889478]
We aim to reduce human workload by predicting connectivity between over-segmented neuron pieces.
We first construct a dataset, named FlyTracing, that contains millions of pairwise connections of segments expanding the whole fly brain.
We propose a novel connectivity-aware contrastive learning method to generate dense volumetric EM image embedding.
arXiv Detail & Related papers (2024-01-05T19:45:12Z) - WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images [5.960501267687475]
We investigate how to generate pathology reports given whole slide images.
We curated the largest WSI-text dataset (PathText)
On the model end, we propose the multiple instance generative model (MI-Gen)
arXiv Detail & Related papers (2023-11-27T05:05:41Z) - Towards a Visual-Language Foundation Model for Computational Pathology [5.72536252929528]
We introduce CONtrastive learning from Captions for Histopathology (CONCH)
CONCH is a visual-language foundation model developed using diverse sources of histopathology images, biomedical text, and task-agnostic pretraining.
It is evaluated on a suite of 13 diverse benchmarks, achieving state-of-the-art performance on histology image classification, segmentation, captioning, text-to-image and image-to-text retrieval.
arXiv Detail & Related papers (2023-07-24T16:13:43Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - XrayGPT: Chest Radiographs Summarization using Medical Vision-Language
Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model.
It can analyze and answer open-ended questions about chest radiographs.
We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z) - AMIGO: Sparse Multi-Modal Graph Transformer with Shared-Context
Processing for Representation Learning of Giga-pixel Images [53.29794593104923]
We present a novel concept of shared-context processing for whole slide histopathology images.
AMIGO uses the celluar graph within the tissue to provide a single representation for a patient.
We show that our model is strongly robust to missing information to an extent that it can achieve the same performance with as low as 20% of the data.
arXiv Detail & Related papers (2023-03-01T23:37:45Z) - Learning from few examples: Classifying sex from retinal images via deep
learning [3.9146761527401424]
We showcase results for the performance of DL on small datasets to classify patient sex from fundus images.
Our models, developed using approximately 2500 fundus images, achieved test AUC scores of up to 0.72.
This corresponds to a mere 25% decrease in performance despite a nearly 1000-fold decrease in the dataset size.
arXiv Detail & Related papers (2022-07-20T02:47:29Z) - HistoKT: Cross Knowledge Transfer in Computational Pathology [31.14107299224401]
The lack of well-annotated datasets in computational pathology (CPath) obstructs the application of deep learning techniques for classifying medical images.
Most transfer learning research follows a model-centric approach, tuning network parameters to improve transfer results over few datasets.
arXiv Detail & Related papers (2022-01-27T00:34:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.