Multi-modal Understanding and Generation for Medical Images and Text via
Vision-Language Pre-Training
- URL: http://arxiv.org/abs/2105.11333v1
- Date: Mon, 24 May 2021 15:14:09 GMT
- Title: Multi-modal Understanding and Generation for Medical Images and Text via
Vision-Language Pre-Training
- Authors: Jong Hak Moon, Hyungyung Lee, Woncheol Shin, Edward Choi
- Abstract summary: We propose Medical Vision Language Learner (MedViLL) which adopts a Transformer-based architecture combined with a novel multimodal attention masking scheme.
We empirically demonstrate the superior downstream task performance of MedViLL against various baselines including task-specific architectures.
- Score: 5.119201893752376
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently a number of studies demonstrated impressive performance on diverse
vision-language multi-modal tasks such as image captioning and visual question
answering by extending the BERT architecture with multi-modal pre-training
objectives. In this work we explore a broad set of multi-modal representation
learning tasks in the medical domain, specifically using radiology images and
the unstructured report. We propose Medical Vision Language Learner (MedViLL)
which adopts a Transformer-based architecture combined with a novel multimodal
attention masking scheme to maximize generalization performance for both
vision-language understanding tasks (image-report retrieval, disease
classification, medical visual question answering) and vision-language
generation task (report generation). By rigorously evaluating the proposed
model on four downstream tasks with two chest X-ray image datasets (MIMIC-CXR
and Open-I), we empirically demonstrate the superior downstream task
performance of MedViLL against various baselines including task-specific
architectures.
Related papers
- VoxelPrompt: A Vision-Language Agent for Grounded Medical Image Analysis [9.937830036053871]
VoxelPrompt tackles diverse radiological tasks through joint modeling of natural language, image volumes, and analytical metrics.
We show that VoxelPrompt can delineate hundreds of anatomical and pathological features, measure many complex morphological properties, and perform open-language analysis of lesion characteristics.
arXiv Detail & Related papers (2024-10-10T22:11:43Z) - ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - Medical Vision Generalist: Unifying Medical Imaging Tasks in Context [30.300087629262666]
This study presents Medical Vision Generalist (MVG), the first foundation model capable of handling various medical imaging tasks.
MVG employs an in-context generation strategy that standardizes the handling of inputs and outputs as images.
Our results consistently establish MVG's superior performance, outperforming existing vision generalists, such as Painter and LVM.
arXiv Detail & Related papers (2024-06-08T20:07:39Z) - Intensive Vision-guided Network for Radiology Report Generation [22.030289124516326]
We propose a Globally-intensive Attention (GIA) module in the medical image encoder to simulate and integrate multi-view vision perception.
We also explore how to involve multi-modal signals to generate precisely matched reports, i.e., how to integrate previously predicted words with region-aware visual content in next word prediction.
arXiv Detail & Related papers (2024-02-06T06:46:46Z) - InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices.
Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z) - Multi-task Paired Masking with Alignment Modeling for Medical
Vision-Language Pre-training [55.56609500764344]
We propose a unified framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework.
We also introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction.
arXiv Detail & Related papers (2023-05-13T13:53:48Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders.
We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.