A Comparison of Pre-trained Vision-and-Language Models for Multimodal
Representation Learning across Medical Images and Reports
- URL: http://arxiv.org/abs/2009.01523v1
- Date: Thu, 3 Sep 2020 09:00:47 GMT
- Title: A Comparison of Pre-trained Vision-and-Language Models for Multimodal
Representation Learning across Medical Images and Reports
- Authors: Yikuan Li, Hanyin Wang and Yuan Luo
- Abstract summary: In this study, we adopt four pre-trained V+L models to learn multimodal representation from MIMIC-CXR radiographs and associated reports.
In comparison to the pioneering CNN-RNN model, the joint embedding learned by pre-trained V+L models demonstrate performance improvement in the thoracic findings classification task.
- Score: 5.074841553282345
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Joint image-text embedding extracted from medical images and associated
contextual reports is the bedrock for most biomedical vision-and-language (V+L)
tasks, including medical visual question answering, clinical image-text
retrieval, clinical report auto-generation. In this study, we adopt four
pre-trained V+L models: LXMERT, VisualBERT, UNIER and PixelBERT to learn
multimodal representation from MIMIC-CXR radiographs and associated reports.
The extrinsic evaluation on OpenI dataset shows that in comparison to the
pioneering CNN-RNN model, the joint embedding learned by pre-trained V+L models
demonstrate performance improvement in the thoracic findings classification
task. We conduct an ablation study to analyze the contribution of certain model
components and validate the advantage of joint embedding over text-only
embedding. We also visualize attention maps to illustrate the attention
mechanism of V+L models.
Related papers
- LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - Visual Prompt Engineering for Medical Vision Language Models in Radiology [0.1636269503300992]
Vision Language Models (VLP) offers a promising solution by leveraging learning to improve zero-shot performance classification.
In this paper, we explore the potential of visual prompt engineering to enhance the potential attention to critical regions.
arXiv Detail & Related papers (2024-08-28T13:53:27Z) - Integrating Medical Imaging and Clinical Reports Using Multimodal Deep Learning for Advanced Disease Analysis [3.8758525789991896]
An innovative multi-modal deep learning model is proposed to deeply integrate heterogeneous information from medical images and clinical reports.
For medical images, convolutional neural networks were used to extract high-dimensional features and capture key visual information.
For clinical report text, a two-way long and short-term memory network combined with an attention mechanism is used for deep semantic understanding.
arXiv Detail & Related papers (2024-05-23T02:22:10Z) - Self-supervised vision-langage alignment of deep learning representations for bone X-rays analysis [53.809054774037214]
This paper proposes leveraging vision-language pretraining on bone X-rays paired with French reports.
It is the first study to integrate French reports to shape the embedding space devoted to bone X-Rays representations.
arXiv Detail & Related papers (2024-05-14T19:53:20Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Application Of Vision-Language Models For Assessing Osteoarthritis
Disease Severity [0.43431539537721414]
Osteoarthritis (OA) poses a global health challenge, demanding precise diagnostic methods.
Existing deep learning models for OA assessment are unimodal single task systems.
This study investigates employing Vision Language Processing models to predict OA severity using Xray images and corresponding reports.
arXiv Detail & Related papers (2024-01-12T02:43:58Z) - IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training [15.04212780946932]
We propose a novel framework named IMITATE to learn the structure information from medical reports with hierarchical vision-language alignment.
The framework derives multi-level visual features from the chest X-ray (CXR) images and separately aligns these features with the descriptive and the conclusive text encoded in the hierarchical medical report.
arXiv Detail & Related papers (2023-10-11T10:12:43Z) - XrayGPT: Chest Radiographs Summarization using Medical Vision-Language
Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model.
It can analyze and answer open-ended questions about chest radiographs.
We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.