Related papers: Large-scale and Fine-grained Vision-language Pre-training for Enhanced CT Image Understanding

Large-scale and Fine-grained Vision-language Pre-training for Enhanced CT Image Understanding

URL: http://arxiv.org/abs/2501.14548v1
Date: Fri, 24 Jan 2025 14:50:48 GMT
Title: Large-scale and Fine-grained Vision-language Pre-training for Enhanced CT Image Understanding
Authors: Zhongyi Shui, Jianpeng Zhang, Weiwei Cao, Sinuo Wang, Ruizhe Guo, Le Lu, Lin Yang, Xianghua Ye, Tingbo Liang, Qi Zhang, Ling Zhang,
Abstract summary: We propose a fine-grained vision-language model (fVLM) for anatomy-level CT image interpretation.<n>Fine-grained alignment, however, faces considerable false-negative challenges.<n>We curated the largest CT dataset to date, comprising imaging and report data from 69,086 patients.
Score: 17.783231335173486
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Artificial intelligence (AI) shows great potential in assisting radiologists to improve the efficiency and accuracy of medical image interpretation and diagnosis. However, a versatile AI model requires large-scale data and comprehensive annotations, which are often impractical in medical settings. Recent studies leverage radiology reports as a naturally high-quality supervision for medical images, using contrastive language-image pre-training (CLIP) to develop language-informed models for radiological image interpretation. Nonetheless, these approaches typically contrast entire images with reports, neglecting the local associations between imaging regions and report sentences, which may undermine model performance and interoperability. In this paper, we propose a fine-grained vision-language model (fVLM) for anatomy-level CT image interpretation. Specifically, we explicitly match anatomical regions of CT images with corresponding descriptions in radiology reports and perform contrastive pre-training for each anatomy individually. Fine-grained alignment, however, faces considerable false-negative challenges, mainly from the abundance of anatomy-level healthy samples and similarly diseased abnormalities. To tackle this issue, we propose identifying false negatives of both normal and abnormal samples and calibrating contrastive learning from patient-level to disease-aware pairing. We curated the largest CT dataset to date, comprising imaging and report data from 69,086 patients, and conducted a comprehensive evaluation of 54 major and important disease diagnosis tasks across 15 main anatomies. Experimental results demonstrate the substantial potential of fVLM in versatile medical image interpretation. In the zero-shot classification task, we achieved an average AUC of 81.3% on 54 diagnosis tasks, surpassing CLIP and supervised methods by 12.9% and 8.0%, respectively.

Related papers

Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training [18.416499501764207]
We propose boosting vision semantic density to improve alignment effectiveness.<n>On one hand, we enhance visual semantics through disease-level vision contrastive learning.<n>On the other hand, we introduce an anatomical normality modeling method to model the distribution of normal samples for each anatomy.
arXiv Detail & Related papers (2025-08-01T06:52:05Z)
OrthoInsight: Rib Fracture Diagnosis and Report Generation Based on Multi-Modal Large Models [0.49478969093606673]
We propose OrthoInsight, a multi-modal deep learning framework for rib fracture diagnosis and report generation.<n>It integrates a YOLOv9 model for fracture detection, a medical knowledge graph for retrieving clinical context, and a fine-tuned LLaVA language model for generating diagnostic reports.<n> evaluated on 28,675 annotated CT images and expert reports, it achieves high performance across Diagnostic Accuracy, Content Completeness, Logical Coherence, and Clinical Guidance Value, with an average score of 4.28.
arXiv Detail & Related papers (2025-07-18T15:01:44Z)
Rethinking Whole-Body CT Image Interpretation: An Abnormality-Centric Approach [57.86418347491272]
We propose a comprehensive hierarchical classification system, with 404 representative abnormal findings across all body regions.<n>We contribute a dataset containing over 14.5K CT images from multiple planes and all human body regions, and meticulously provide grounding annotations for over 19K abnormalities.<n>We propose OminiAbnorm-CT, which can automatically ground and describe abnormal findings on multi-plane and whole-body CT images based on text queries.
arXiv Detail & Related papers (2025-06-03T17:57:34Z)
Causal Disentanglement for Robust Long-tail Medical Image Generation [80.15257897500578]
We propose a novel medical image generation framework, which generates independent pathological and structural features. We leverage a diffusion model guided by pathological findings to model pathological features, enabling the generation of diverse counterfactual images.
arXiv Detail & Related papers (2025-04-20T01:54:18Z)
iMedImage Technical Report [5.0953390013898705]
Chromosome karyotype analysis is crucial for diagnosing hereditary diseases, yet detecting structural abnormalities remains challenging. We developed iMedImage, an end-to-end model for general medical image recognition, demonstrating strong performance across multiple imaging tasks.
arXiv Detail & Related papers (2025-03-27T03:25:28Z)
CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z)
Radiology Report Generation Using Transformers Conditioned with Non-imaging Data [55.17268696112258]
This paper proposes a novel multi-modal transformer network that integrates chest x-ray (CXR) images and associated patient demographic information. The proposed network uses a convolutional neural network to extract visual features from CXRs and a transformer-based encoder-decoder network that combines the visual features with semantic text embeddings of patient demographic information.
arXiv Detail & Related papers (2023-11-18T14:52:26Z)
Beyond Images: An Integrative Multi-modal Approach to Chest X-Ray Report Generation [47.250147322130545]
Image-to-text radiology report generation aims to automatically produce radiology reports that describe the findings in medical images. Most existing methods focus solely on the image data, disregarding the other patient information accessible to radiologists. We present a novel multi-modal deep neural network framework for generating chest X-rays reports by integrating structured patient data, such as vital signs and symptoms, alongside unstructured clinical notes.
arXiv Detail & Related papers (2023-11-18T14:37:53Z)
An Empirical Analysis for Zero-Shot Multi-Label Classification on COVID-19 CT Scans and Uncurated Reports [0.5527944417831603]
pandemic resulted in vast repositories of unstructured data, including radiology reports, due to increased medical examinations. Previous research on automated diagnosis of COVID-19 primarily focuses on X-ray images, despite their lower precision compared to computed tomography (CT) scans. In this work, we leverage unstructured data from a hospital and harness the fine-grained details offered by CT scans to perform zero-shot multi-label classification based on contrastive visual language learning.
arXiv Detail & Related papers (2023-09-04T17:58:01Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
Self-supervised Learning from 100 Million Medical Images [13.958840691105992]
We propose a method for self-supervised learning of rich image features based on contrastive learning and online feature clustering. We leverage large training datasets of over 100,000,000 medical images of various modalities, including radiography, computed tomography (CT), magnetic resonance (MR) imaging and ultrasonography. We highlight a number of advantages of this strategy on challenging image assessment problems in radiography, CT and MR.
arXiv Detail & Related papers (2022-01-04T18:27:04Z)
Potential Features of ICU Admission in X-ray Images of COVID-19 Patients [8.83608410540057]
This paper presents an original methodology for extracting semantic features that correlate to severity from a data set with patient ICU admission labels. The methodology employs a neural network trained to recognise lung pathologies to extract the semantic features. The method has shown to be capable of selecting images for the learned features, which could translate some information about their common locations in the lung.
arXiv Detail & Related papers (2020-09-26T13:48:39Z)
Integrative Analysis for COVID-19 Patient Outcome Prediction [53.11258640541513]
We combine radiomics of lung opacities and non-imaging features from demographic data, vital signs, and laboratory findings to predict need for intensive care unit admission. Our methods may also be applied to other lung diseases including but not limited to community acquired pneumonia.
arXiv Detail & Related papers (2020-07-20T19:08:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.