Customizing General-Purpose Foundation Models for Medical Report
Generation
- URL: http://arxiv.org/abs/2306.05642v1
- Date: Fri, 9 Jun 2023 03:02:36 GMT
- Title: Customizing General-Purpose Foundation Models for Medical Report
Generation
- Authors: Bang Yang, Asif Raza, Yuexian Zou, Tong Zhang
- Abstract summary: The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
- Score: 64.31265734687182
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Medical caption prediction which can be regarded as a task of medical report
generation (MRG), requires the automatic generation of coherent and accurate
captions for the given medical images. However, the scarcity of labelled
medical image-report pairs presents great challenges in the development of deep
and large-scale neural networks capable of harnessing the potential artificial
general intelligence power like large language models (LLMs). In this work, we
propose customizing off-the-shelf general-purpose large-scale pre-trained
models, i.e., foundation models (FMs), in computer vision and natural language
processing with a specific focus on medical report generation. Specifically,
following BLIP-2, a state-of-the-art vision-language pre-training approach, we
introduce our encoder-decoder-based MRG model. This model utilizes a
lightweight query Transformer to connect two FMs: the giant vision Transformer
EVA-ViT-g and a bilingual LLM trained to align with human intentions (referred
to as ChatGLM-6B). Furthermore, we conduct ablative experiments on the
trainable components of the model to identify the crucial factors for effective
transfer learning. Our findings demonstrate that unfreezing EVA-ViT-g to learn
medical image representations, followed by parameter-efficient training of
ChatGLM-6B to capture the writing styles of medical reports, is essential for
achieving optimal results. Our best attempt (PCLmed Team) achieved the 4th and
the 2nd, respectively, out of 13 participating teams, based on the BERTScore
and ROUGE-1 metrics, in the ImageCLEFmedical Caption 2023 Caption Prediction
Task competition.
Related papers
- LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation [0.8437187555622164]
This study explores enhancing Vision Transformers (ViTs) for medical image segmentation by integrating pre-trained LLM transformer blocks.
Our approach, which incorporates a frozen LLM transformer block into the encoder of a ViT-based model, leads to substantial improvements in segmentation performance.
The enhanced model shows significant performance gains, including an average Dice score increase from 0.74 to 0.79 and improvements in accuracy, precision, and the Jaccard Index.
arXiv Detail & Related papers (2024-10-03T14:50:33Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Freeze the backbones: A Parameter-Efficient Contrastive Approach to
Robust Medical Vision-Language Pre-training [15.790435273150083]
We introduce the backbone-agnostic Adaptor framework, which preserves medical knowledge in pre-trained image and text encoders by keeping them frozen.
Our framework delivers competitive performance while cutting trainable parameters by over 90% compared to current pre-training approaches.
arXiv Detail & Related papers (2024-01-02T12:14:41Z) - Improving Medical Report Generation with Adapter Tuning and Knowledge
Enhancement in Vision-Language Foundation Models [26.146579369491718]
This study builds upon the state-of-the-art vision-language pre-training and fine-tuning approach, BLIP-2, to customize general large-scale foundation models.
Validation on the dataset of ImageCLEFmedical 2023 demonstrates our model's prowess, achieving the best-averaged results against several state-of-the-art methods.
arXiv Detail & Related papers (2023-12-07T01:01:45Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - MedSegDiff-V2: Diffusion based Medical Image Segmentation with
Transformer [53.575573940055335]
We propose a novel Transformer-based Diffusion framework, called MedSegDiff-V2.
We verify its effectiveness on 20 medical image segmentation tasks with different image modalities.
arXiv Detail & Related papers (2023-01-19T03:42:36Z) - Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation [116.87918100031153]
We propose a Cross-modal clinical Graph Transformer (CGT) for ophthalmic report generation (ORG)
CGT injects clinical relation triples into the visual features as prior knowledge to drive the decoding procedure.
Experiments on the large-scale FFA-IR benchmark demonstrate that the proposed CGT is able to outperform previous benchmark methods.
arXiv Detail & Related papers (2022-06-04T13:16:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.