Related papers: Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography

Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography

URL: http://arxiv.org/abs/2403.17834v2
Date: Wed, 16 Oct 2024 12:49:19 GMT
Title: Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography
Authors: Ibrahim Ethem Hamamci, Sezgin Er, Furkan Almas, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Irem Dogan, Muhammed Furkan Dasdelen, Omer Faruk Durugol, Bastian Wittmann, Tamaz Amiranashvili, Enis Simsar, Mehmet Simsar, Emine Bensu Erdemir, Abdullah Alanbay, Anjany Sekuboyina, Berkan Lafci, Christian Bluethgen, Mehmet Kemal Ozdemir, Bjoern Menze,
Abstract summary: We introduce CT-RATE, the first dataset that pairs 3D medical images with corresponding textual reports. We develop CT-CLIP, a CT-focused contrastive language-image pretraining framework. We create CT-CHAT, a vision-language foundational chat model for 3D chest CT volumes.
Score: 1.8424705673580284
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: While computer vision has achieved tremendous success with multimodal encoding and direct textual interaction with images via chat-based large language models, similar advancements in medical imaging AI, particularly in 3D imaging, have been limited due to the scarcity of comprehensive datasets. To address this critical gap, we introduce CT-RATE, the first dataset that pairs 3D medical images with corresponding textual reports. CT-RATE comprises 25,692 non-contrast 3D chest CT scans from 21,304 unique patients. Through various reconstructions, these scans are expanded to 50,188 volumes, totaling over 14.3 million 2D slices. Each scan is accompanied by its corresponding radiology report. Leveraging CT-RATE, we develop CT-CLIP, a CT-focused contrastive language-image pretraining framework designed for broad applications without the need for task-specific training. We demonstrate how CT-CLIP can be used in two tasks: multi-abnormality detection and case retrieval. Remarkably, in multi-abnormality detection, CT-CLIP outperforms state-of-the-art fully supervised models across all key metrics, effectively eliminating the need for manual annotation. In case retrieval, it efficiently retrieves relevant cases using either image or textual queries, thereby enhancing knowledge dissemination. By combining CT-CLIP's vision encoder with a pretrained large language model, we create CT-CHAT, a vision-language foundational chat model for 3D chest CT volumes. Finetuned on over 2.7 million question-answer pairs derived from the CT-RATE dataset, CT-CHAT surpasses other multimodal AI assistants, underscoring the necessity for specialized methods in 3D medical imaging. Collectively, the open-source release of CT-RATE, CT-CLIP, and CT-CHAT not only addresses critical challenges in 3D medical imaging but also lays the groundwork for future innovations in medical AI and improved patient care.

Related papers

CT-ScanGaze: A Dataset and Baselines for 3D Volumetric Scanpath Modeling [12.457017701871273]
We present the first publicly available eye gaze dataset on CT, called CT-ScanGaze.<n>We then introduce CT-Searcher, a novel 3D scanpath predictor designed specifically to process CT volumes and generate radiologist-like 3D fixation sequences.
arXiv Detail & Related papers (2025-07-16T19:21:05Z)
Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation [18.113659670915474]
We propose a large language model (LLM) based CTRG method with recurrent visual feature extraction and stereo attentions for hierarchical feature modeling.<n>Specifically, we use a vision Transformer to recurrently process each slice in a CT volume, and employ a set of attentions over the encoded slices from different perspectives to obtain important visual information.<n>Experiment results and further analysis on the benchmark M3D-Cap dataset show that our method outperforms strong baseline models.
arXiv Detail & Related papers (2025-06-24T14:29:06Z)
Towards Scalable Language-Image Pre-training for 3D Medical Imaging [49.18894445671976]
We introduce Hierarchical attention for Language-Image Pre-training (HLIP), a scalable pre-training framework for 3D medical imaging.<n>HLIP adopts a lightweight hierarchical attention mechanism inspired by the natural hierarchy of radiology data: slice, scan, and study.<n>Trained on 220K patients with 3.13 million scans for brain MRI and 240K patients with 1.44 million scans for head CT, HLIP achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-05-28T01:16:34Z)
CT-Agent: A Multimodal-LLM Agent for 3D CT Radiology Question Answering [23.158482226185217]
A visual question answering (VQA) system that can answer radiologists' questions about some anatomical regions on the CT scan is urgently needed.<n>Existing VQA systems cannot adequately handle the CT radiology question answering (CTQA) task for: (1) anatomic complexity makes CT images difficult to understand; (2) spatial relationship across hundreds slices is difficult to capture.<n>This paper proposes CT-Agent, a multimodal agentic framework for CTQA.
arXiv Detail & Related papers (2025-05-22T04:59:20Z)
A Continual Learning-driven Model for Accurate and Generalizable Segmentation of Clinically Comprehensive and Fine-grained Whole-body Anatomies in CT [67.34586036959793]
There is no fully annotated CT dataset with all anatomies delineated for training. We propose a novel continual learning-driven CT model that can segment complete anatomies. Our single unified CT segmentation model, CL-Net, can highly accurately segment a clinically comprehensive set of 235 fine-grained whole-body anatomies.
arXiv Detail & Related papers (2025-03-16T23:55:02Z)
3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models [51.855377054763345]
This paper introduces 3D-CT-GPT, a Visual Question Answering (VQA)-based medical visual language model for generating radiology reports from 3D CT scans. Experiments on both public and private datasets demonstrate that 3D-CT-GPT significantly outperforms existing methods in terms of report accuracy and quality.
arXiv Detail & Related papers (2024-09-28T12:31:07Z)
RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis [56.57177181778517]
RadGenome-Chest CT is a large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE. We leverage the latest powerful universal segmentation and large language models to extend the original datasets.
arXiv Detail & Related papers (2024-04-25T17:11:37Z)
CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z)
Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models [17.75505740079875]
We explore the feasibility of leveraging language as a naturally high-quality supervision for chest CT imaging. We bootstrap the understanding of 3D chest CT images by distilling chest-related diagnostic knowledge from an extensively pre-trained 2D X-ray expert model. We train our model with over 12,000 pairs of chest CT images and radiology reports.
arXiv Detail & Related papers (2024-04-07T12:17:40Z)
CT2Rep: Automated Radiology Report Generation for 3D Medical Imaging [0.20754235913398283]
We introduce the first method to generate radiology reports for 3D medical imaging, specifically targeting chest CT. Given the absence of comparable methods, we establish a baseline using an advanced 3D vision encoder in medical imaging to demonstrate our method's effectiveness. We augment CT2Rep with a cross-attention-based multi-modal fusion module and hierarchical memory, enabling the incorporation of longitudinal multimodal data.
arXiv Detail & Related papers (2024-03-11T15:17:45Z)
Multi-View Vertebra Localization and Identification from CT Images [57.56509107412658]
We propose a multi-view vertebra localization and identification from CT images. We convert the 3D problem into a 2D localization and identification task on different views. Our method can learn the multi-view global information naturally.
arXiv Detail & Related papers (2023-07-24T14:43:07Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
Self-supervised 3D anatomy segmentation using self-distilled masked image transformer (SMIT) [2.7298989068857487]
Self-supervised learning has demonstrated success in medical image segmentation using convolutional networks. We show our approach is more accurate and requires fewer fine tuning datasets than other pretext tasks.
arXiv Detail & Related papers (2022-05-20T17:55:14Z)
Fed-Sim: Federated Simulation for Medical Imaging [131.56325440976207]
We introduce a physics-driven generative approach that consists of two learnable neural modules. We show that our data synthesis framework improves the downstream segmentation performance on several datasets.
arXiv Detail & Related papers (2020-09-01T19:17:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.