MOSMOS: Multi-organ segmentation facilitated by medical report supervision
- URL: http://arxiv.org/abs/2409.02418v1
- Date: Wed, 4 Sep 2024 03:46:17 GMT
- Title: MOSMOS: Multi-organ segmentation facilitated by medical report supervision
- Authors: Weiwei Tian, Xinyu Huang, Junlin Hou, Caiyue Ren, Longquan Jiang, Rui-Wei Zhao, Gang Jin, Yuejie Zhang, Daoying Geng,
- Abstract summary: We propose a novel pre-training & fine-tuning framework for Multi-Organ Supervision (MOS)
Specifically, we first introduce global contrastive learning to align medical image-report pairs in the pre-training stage.
To remedy the discrepancy, we further leverage multi-label recognition to implicitly learn the semantic correspondence between image pixels and organ tags.
- Score: 10.396987980136602
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Owing to a large amount of multi-modal data in modern medical systems, such as medical images and reports, Medical Vision-Language Pre-training (Med-VLP) has demonstrated incredible achievements in coarse-grained downstream tasks (i.e., medical classification, retrieval, and visual question answering). However, the problem of transferring knowledge learned from Med-VLP to fine-grained multi-organ segmentation tasks has barely been investigated. Multi-organ segmentation is challenging mainly due to the lack of large-scale fully annotated datasets and the wide variation in the shape and size of the same organ between individuals with different diseases. In this paper, we propose a novel pre-training & fine-tuning framework for Multi-Organ Segmentation by harnessing Medical repOrt Supervision (MOSMOS). Specifically, we first introduce global contrastive learning to maximally align the medical image-report pairs in the pre-training stage. To remedy the granularity discrepancy, we further leverage multi-label recognition to implicitly learn the semantic correspondence between image pixels and organ tags. More importantly, our pre-trained models can be transferred to any segmentation model by introducing the pixel-tag attention maps. Different network settings, i.e., 2D U-Net and 3D UNETR, are utilized to validate the generalization. We have extensively evaluated our approach using different diseases and modalities on BTCV, AMOS, MMWHS, and BRATS datasets. Experimental results in various settings demonstrate the effectiveness of our framework. This framework can serve as the foundation to facilitate future research on automatic annotation tasks under the supervision of medical reports.
Related papers
- Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks [13.016940516468674]
We aim to enhance the capability of medical MLLMs in understanding anatomical regions within entire medical scans.
We propose a Region-Aware medical MLLM, MedRegA, which is the first bilingual generalist medical AI system.
Our model can not only accomplish powerful performance across various medical vision-language tasks in bilingual settings, but also recognize and detect structures in multimodal medical scans.
arXiv Detail & Related papers (2024-10-24T02:55:41Z) - MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation [40.9095393430871]
We introduce MedViLaM, a unified vision-language model towards a generalist model for medical data.
MedViLaM can flexibly encode and interpret various forms of medical data, including clinical language and imaging.
We present instances of zero-shot generalization to new medical concepts and tasks, effective transfer learning across different tasks, and the emergence of zero-shot medical reasoning.
arXiv Detail & Related papers (2024-09-29T12:23:10Z) - ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - UniMOS: A Universal Framework For Multi-Organ Segmentation Over
Label-Constrained Datasets [6.428456997507811]
We present UniMOS, the first universal framework for achieving the utilization of fully and partially labeled images as well as unlabeled images.
We incorporate a semi-supervised training module that combines consistent regularization and pseudolabeling techniques on unlabeled data.
Experiments show that the framework exhibits excellent performance in several medical image segmentation tasks compared to other advanced methods.
arXiv Detail & Related papers (2023-11-17T00:44:56Z) - C^2M-DoT: Cross-modal consistent multi-view medical report generation
with domain transfer network [67.97926983664676]
We propose a cross-modal consistent multi-view medical report generation with a domain transfer network (C2M-DoT)
C2M-DoT substantially outperforms state-of-the-art baselines in all metrics.
arXiv Detail & Related papers (2023-10-09T02:31:36Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Training Like a Medical Resident: Context-Prior Learning Toward Universal Medical Image Segmentation [38.61227663176952]
We propose a shift towards universal medical image segmentation, a paradigm aiming to build medical image understanding foundation models.
We develop Hermes, a novel context-prior learning approach to address the challenges of data heterogeneity and annotation differences in medical image segmentation.
arXiv Detail & Related papers (2023-06-04T17:39:08Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Mine yOur owN Anatomy: Revisiting Medical Image Segmentation with Extremely Limited Labels [54.58539616385138]
We introduce a novel semi-supervised 2D medical image segmentation framework termed Mine yOur owN Anatomy (MONA)
First, prior work argues that every pixel equally matters to the model training; we observe empirically that this alone is unlikely to define meaningful anatomical features.
Second, we construct a set of objectives that encourage the model to be capable of decomposing medical images into a collection of anatomical features.
arXiv Detail & Related papers (2022-09-27T15:50:31Z) - Cross-Modal Information Maximization for Medical Imaging: CMIM [62.28852442561818]
In hospitals, data are siloed to specific information systems that make the same information available under different modalities.
This offers unique opportunities to obtain and use at train-time those multiple views of the same information that might not always be available at test-time.
We propose an innovative framework that makes the most of available data by learning good representations of a multi-modal input that are resilient to modality dropping at test-time.
arXiv Detail & Related papers (2020-10-20T20:05:35Z) - Uncertainty-aware multi-view co-training for semi-supervised medical
image segmentation and domain adaptation [35.33425093398756]
Unlabeled data is much easier to acquire than well-annotated data.
We propose uncertainty-aware multi-view co-training for medical image segmentation.
Our framework is capable of efficiently utilizing unlabeled data for better performance.
arXiv Detail & Related papers (2020-06-28T22:04:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.