CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology
- URL: http://arxiv.org/abs/2412.12077v1
- Date: Mon, 16 Dec 2024 18:46:58 GMT
- Title: CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology
- Authors: Yuxuan Sun, Yixuan Si, Chenglu Zhu, Xuan Gong, Kai Zhang, Pingyi Chen, Ye Zhang, Zhongyi Shui, Tao Lin, Lin Yang,
- Abstract summary: CPath- Omni is the first LMM designed to unify both patch and WSI level image analysis.
CPath- Omni achieves state-of-the-art (SOTA) performance across seven diverse tasks on 39 out of 42 datasets.
CPath-CLIP, for the first time, integrates different vision models and incorporates a large language model as a text encoder to build a more powerful CLIP model.
- Score: 17.781388341968967
- License:
- Abstract: The emergence of large multimodal models (LMMs) has brought significant advancements to pathology. Previous research has primarily focused on separately training patch-level and whole-slide image (WSI)-level models, limiting the integration of learned knowledge across patches and WSIs, and resulting in redundant models. In this work, we introduce CPath-Omni, the first 15-billion-parameter LMM designed to unify both patch and WSI level image analysis, consolidating a variety of tasks at both levels, including classification, visual question answering, captioning, and visual referring prompting. Extensive experiments demonstrate that CPath-Omni achieves state-of-the-art (SOTA) performance across seven diverse tasks on 39 out of 42 datasets, outperforming or matching task-specific models trained for individual tasks. Additionally, we develop a specialized pathology CLIP-based visual processor for CPath-Omni, CPath-CLIP, which, for the first time, integrates different vision models and incorporates a large language model as a text encoder to build a more powerful CLIP model, which achieves SOTA performance on nine zero-shot and four few-shot datasets. Our findings highlight CPath-Omni's ability to unify diverse pathology tasks, demonstrating its potential to streamline and advance the field of foundation model in pathology.
Related papers
- MLLM4PUE: Toward Universal Embeddings in Computational Pathology through Multimodal LLMs [34.454047458272505]
We highlight the need for universal multimodal embeddings that can support multiple downstream tasks.
Previous approaches often involve fine-tuning CLIP-based models, which handle images and text separately.
We introduce the Pathology Multimodal Embedding Benchmark (PMEB), a benchmark designed to assess the quality of pathology multimodal embeddings.
arXiv Detail & Related papers (2025-02-11T03:28:55Z) - Free Lunch in Pathology Foundation Model: Task-specific Model Adaptation with Concept-Guided Feature Enhancement [18.839406725114042]
We present Concept Anchor-guided Task-specific Feature Enhancement (CATE)
CATE can boost the expressivity and discriminativeness of pathology foundation models for specific downstream tasks.
Experiments on public WSI datasets demonstrate that CATE significantly enhances the performance and generalizability of MIL models.
arXiv Detail & Related papers (2024-11-15T02:38:00Z) - MECFormer: Multi-task Whole Slide Image Classification with Expert Consultation Network [2.6954348706500766]
Whole slide image (WSI) classification is a crucial problem for cancer diagnostics in clinics and hospitals.
Previous MIL-based models designed for this problem have only been evaluated on individual tasks for specific organs.
We propose MECFormer, a generative Transformer-based model designed to handle multiple tasks within one model.
arXiv Detail & Related papers (2024-10-06T14:56:23Z) - PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration [14.979275480422213]
Vision Language Models (VLMs) like CLIP have attracted substantial attention in pathology.
Current efforts to train pathology VLMs rely on pathology image-text pairs from platforms like PubMed, YouTube, and Twitter.
We leverage large-scale WSI datasets like TCGA to extract numerous high-quality image patches.
arXiv Detail & Related papers (2024-06-28T19:18:09Z) - MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - Explore In-Context Segmentation via Latent Diffusion Models [132.26274147026854]
latent diffusion model (LDM) is an effective minimalist for in-context segmentation.
We build a new and fair in-context segmentation benchmark that includes both image and video datasets.
arXiv Detail & Related papers (2024-03-14T17:52:31Z) - PathoDuet: Foundation Models for Pathological Slide Analysis of H&E and IHC Stains [5.422494000842841]
We present PathoDuet, a series of pretrained models on histopathology images, and a new self-supervised learning framework in histochemistry.
The framework is featured by a newly-introduced pretext token and later task raisers to explicitly utilize certain relations between images.
Two pretext tasks, cross-scale positioning and cross-stain transferring, are designed to pretrain the model on Hematoxylin and Eosin images.
arXiv Detail & Related papers (2023-12-15T15:45:52Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - AIMS: All-Inclusive Multi-Level Segmentation [93.5041381700744]
We propose a new task, All-Inclusive Multi-Level (AIMS), which segments visual regions into three levels: part, entity, and relation.
We also build a unified AIMS model through multi-dataset multi-task training to address the two major challenges of annotation inconsistency and task correlation.
arXiv Detail & Related papers (2023-05-28T16:28:49Z) - Learning Customized Visual Models with Retrieval-Augmented Knowledge [104.05456849611895]
We propose REACT, a framework to acquire the relevant web knowledge to build customized visual models for target domains.
We retrieve the most relevant image-text pairs from the web-scale database as external knowledge, and propose to customize the model by only training new modualized blocks while freezing all the original weights.
The effectiveness of REACT is demonstrated via extensive experiments on classification, retrieval, detection and segmentation tasks, including zero, few, and full-shot settings.
arXiv Detail & Related papers (2023-01-17T18:59:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.