Related papers: MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

URL: http://arxiv.org/abs/2408.02900v1
Date: Tue, 6 Aug 2024 02:09:35 GMT
Title: MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine
Authors: Yunfei Xie, Ce Zhou, Lang Gao, Juncheng Wu, Xianhang Li, Hong-Yu Zhou, Sheng Liu, Lei Xing, James Zou, Cihang Xie, Yuyin Zhou,
Abstract summary: This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine. It covers over 25 million images across 10 modalities, with multigranular annotations for more than 65 diseases. Unlike existing approach which is limited by the availability of image-text pairs, we have developed the first automated pipeline.
Score: 53.01393667775077
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine, covering over 25 million images across 10 modalities, with multigranular annotations for more than 65 diseases. These enriched annotations encompass both global textual information, such as disease/lesion type, modality, region-specific descriptions, and inter-regional relationships, as well as detailed local annotations for regions of interest (ROIs), including bounding boxes, segmentation masks. Unlike existing approach which is limited by the availability of image-text pairs, we have developed the first automated pipeline that scales up multimodal data by generating multigranular visual and texual annotations (in the form of image-ROI-description triplets) without the need for any paired text descriptions. Specifically, data from over 90 different sources have been collected, preprocessed, and grounded using domain-specific expert models to identify ROIs related to abnormal regions. We then build a comprehensive knowledge base and prompt multimodal large language models to perform retrieval-augmented generation with the identified ROIs as guidance, resulting in multigranular texual descriptions. Compared to existing datasets, MedTrinity-25M provides the most enriched annotations, supporting a comprehensive range of multimodal tasks such as captioning and report generation, as well as vision-centric tasks like classification and segmentation. Pretraining on MedTrinity-25M, our model achieves state-of-the-art performance on VQA-RAD and PathVQA, surpassing both multimodal large language models and other representative SoTA approaches. This dataset can also be utilized to support large-scale pre-training of multimodal medical AI models, contributing to the development of future foundation models in the medical domain.

Related papers

Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval [67.73095846666583]
Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition.<n>This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era.
arXiv Detail & Related papers (2026-02-23T15:27:41Z)
UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities [68.12889379702824]
Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks. UniMed is a large-scale, open-source multi-modal medical dataset comprising over 5.3 million image-text pairs. We trained UniMed-CLIP, a unified VLM for six modalities, achieving notable gains in zero-shot evaluations.
arXiv Detail & Related papers (2024-12-13T18:59:40Z)
MRGen: Segmentation Data Engine for Underrepresented MRI Modalities [59.61465292965639]
Training medical image segmentation models for rare yet clinically important imaging modalities is challenging due to the scarcity of annotated data.<n>This paper investigates leveraging generative models to synthesize data, for training segmentation models for underrepresented modalities.<n>We present MRGen, a data engine for controllable medical image synthesis conditioned on text prompts and segmentation masks.
arXiv Detail & Related papers (2024-12-04T16:34:22Z)
Multimodal Medical Disease Classification with LLaMA II [0.14999444543328289]
We use the text-image pair dataset from OpenI consisting of 2D chest X-rays associated with clinical reports. Our focus is on fusion methods for merging text and vision information extracted from medical datasets. The newly introduced multimodal architecture can be applied to other multimodal datasets with little effort and can be easily adapted for further research.
arXiv Detail & Related papers (2024-12-02T09:18:07Z)
LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training. LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions. Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z)
MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation [40.9095393430871]
We introduce MedViLaM, a unified vision-language model towards a generalist model for medical data. MedViLaM can flexibly encode and interpret various forms of medical data, including clinical language and imaging. We present instances of zero-shot generalization to new medical concepts and tasks, effective transfer learning across different tasks, and the emergence of zero-shot medical reasoning.
arXiv Detail & Related papers (2024-09-29T12:23:10Z)
ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports. Based on this dataset, we focus on the challanging task of unsupervised pretraining. We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z)
A Refer-and-Ground Multimodal Large Language Model for Biomedicine [10.519866875035003]
The Med-GRIT-270k dataset is the first dedicated to the biomedical domain and integrates refer and ground conversations. We introduce a Refer-and-Ground Multimodal Large Language Model for Biomedicine (BiRD) by using this dataset and multi-task instruction learning.
arXiv Detail & Related papers (2024-06-26T07:56:17Z)
TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset. It contains 39,153 text-rich images, captions, and 102,437 questions. We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z)
Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z)
RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis [56.57177181778517]
RadGenome-Chest CT is a large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE. We leverage the latest powerful universal segmentation and large language models to extend the original datasets.
arXiv Detail & Related papers (2024-04-25T17:11:37Z)
Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models [17.643421997037514]
We propose a novel framework that tackles both discriminative and generative multimodal medical tasks. The learning of Med-MoE consists of three steps: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning. Our model can achieve performance superior to or on par with state-of-the-art baselines.
arXiv Detail & Related papers (2024-04-16T02:35:17Z)
C^2M-DoT: Cross-modal consistent multi-view medical report generation with domain transfer network [67.97926983664676]
We propose a cross-modal consistent multi-view medical report generation with a domain transfer network (C2M-DoT) C2M-DoT substantially outperforms state-of-the-art baselines in all metrics.
arXiv Detail & Related papers (2023-10-09T02:31:36Z)
Large Scale Multi-Lingual Multi-Modal Summarization Dataset [26.92121230628835]
We present the current largest multi-lingual multi-modal summarization dataset (M3LS) It consists of over a million instances of document-image pairs along with a professionally annotated multi-modal summary for each pair. It is also the largest summarization dataset for 13 languages and consists of cross-lingual summarization data for 2 languages.
arXiv Detail & Related papers (2023-02-13T18:00:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.