Multi-modal Pre-training for Medical Vision-language Understanding and
Generation: An Empirical Study with A New Benchmark
- URL: http://arxiv.org/abs/2306.06494v2
- Date: Thu, 24 Aug 2023 07:52:59 GMT
- Title: Multi-modal Pre-training for Medical Vision-language Understanding and
Generation: An Empirical Study with A New Benchmark
- Authors: Li Xu, Bo Liu, Ameer Hamza Khan, Lu Fan, Xiao-Ming Wu
- Abstract summary: We propose RadioGraphy Captions (RGC), a high-quality, multi-modality radiographic dataset containing 18,434 image-caption pairs.
RGC can be used as a pre-training dataset or a new benchmark for medical report generation and medical image-text retrieval.
- Score: 12.565598914787834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the availability of large-scale, comprehensive, and general-purpose
vision-language (VL) datasets such as MSCOCO, vision-language pre-training
(VLP) has become an active area of research and proven to be effective for
various VL tasks such as visual-question answering. However, studies on VLP in
the medical domain have so far been scanty. To provide a comprehensive
perspective on VLP for medical VL tasks, we conduct a thorough experimental
analysis to study key factors that may affect the performance of VLP with a
unified vision-language Transformer. To allow making sound and quick
pre-training decisions, we propose RadioGraphy Captions (RGC), a high-quality,
multi-modality radiographic dataset containing 18,434 image-caption pairs
collected from an open-access online database MedPix. RGC can be used as a
pre-training dataset or a new benchmark for medical report generation and
medical image-text retrieval. By utilizing RGC and other available datasets for
pre-training, we develop several key insights that can guide future medical VLP
research and new strong baselines for various medical VL tasks.
Related papers
- A Survey of Medical Vision-and-Language Applications and Their Techniques [48.268198631277315]
Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data.
Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied.
We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics.
arXiv Detail & Related papers (2024-11-19T03:27:05Z) - STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering [58.79671189792399]
STLLaVA-Med is designed to train a policy model capable of auto-generating medical visual instruction data.
We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks.
arXiv Detail & Related papers (2024-06-28T15:01:23Z) - Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions.
VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information.
We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z) - Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review [0.0]
Medical vision-language models (VLMs) combine computer vision (CV) and natural language processing (NLP) to analyze medical data.
Our paper reviews recent advancements in developing models designed for medical report generation and visual question answering.
arXiv Detail & Related papers (2024-03-04T20:29:51Z) - MISS: A Generative Pretraining and Finetuning Approach for Med-VQA [16.978523518972533]
We propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks.
We unify the text encoder and multimodal encoder and align image-text features through multi-task learning.
Our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models.
arXiv Detail & Related papers (2024-01-10T13:56:40Z) - Freeze the backbones: A Parameter-Efficient Contrastive Approach to
Robust Medical Vision-Language Pre-training [15.790435273150083]
We introduce the backbone-agnostic Adaptor framework, which preserves medical knowledge in pre-trained image and text encoders by keeping them frozen.
Our framework delivers competitive performance while cutting trainable parameters by over 90% compared to current pre-training approaches.
arXiv Detail & Related papers (2024-01-02T12:14:41Z) - Medical Vision Language Pretraining: A survey [8.393439175704124]
Medical Vision Language Pretraining is a promising solution to the scarcity of labeled data in the medical domain.
By leveraging paired/unpaired vision and text datasets through self-supervised learning, models can be trained to acquire vast knowledge and learn robust feature representations.
arXiv Detail & Related papers (2023-12-11T09:14:13Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Align, Reason and Learn: Enhancing Medical Vision-and-Language
Pre-training with Knowledge [68.90835997085557]
We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives.
First, we align the representations of the vision encoder and the language encoder through knowledge.
Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text.
Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
arXiv Detail & Related papers (2022-09-15T08:00:01Z) - PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling.
PEVL reformulates discretized object positions and language in a unified language modeling framework.
We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.