UniDCP: Unifying Multiple Medical Vision-language Tasks via Dynamic
Cross-modal Learnable Prompts
- URL: http://arxiv.org/abs/2312.11171v1
- Date: Mon, 18 Dec 2023 13:18:24 GMT
- Title: UniDCP: Unifying Multiple Medical Vision-language Tasks via Dynamic
Cross-modal Learnable Prompts
- Authors: Chenlu Zhan, Yufei Zhang, Yu Lin, Gaoang Wang, Hongwei Wang
- Abstract summary: We propose UniDCP, a Unified medical vision-language model with Dynamic Cross-modal learnable Prompts.
UniDCP is capable of performing all 8 medical uni-modal and cross-modal tasks over 14 corresponding datasets.
- Score: 14.681493967465693
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical vision-language pre-training (Med-VLP) models have recently
accelerated the fast-growing medical diagnostics application. However, most
Med-VLP models learn task-specific representations independently from scratch,
thereby leading to great inflexibility when they work across multiple
fine-tuning tasks. In this work, we propose UniDCP, a Unified medical
vision-language model with Dynamic Cross-modal learnable Prompts, which can be
plastically applied to multiple medical vision-language tasks. Specifically, we
explicitly construct a unified framework to harmonize diverse inputs from
multiple pretraining tasks by leveraging cross-modal prompts for unification,
which accordingly can accommodate heterogeneous medical fine-tuning tasks.
Furthermore, we conceive a dynamic cross-modal prompt optimizing strategy that
optimizes the prompts within the shareable space for implicitly processing the
shareable clinic knowledge. UniDCP is the first Med-VLP model capable of
performing all 8 medical uni-modal and cross-modal tasks over 14 corresponding
datasets, consistently yielding superior results over diverse state-of-the-art
methods.
Related papers
- Demystifying Large Language Models for Medicine: A Primer [50.83806796466396]
Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare.
This tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice.
arXiv Detail & Related papers (2024-10-24T15:41:56Z) - FlexCare: Leveraging Cross-Task Synergy for Flexible Multimodal Healthcare Prediction [34.732561455987145]
We propose a unified healthcare prediction model, also named by textbfFlexCare, to flexibly accommodate incomplete multimodal inputs.
A task-agnostic multimodal information extraction module is presented to capture decorrelated representations of diverse intra- and inter-modality patterns.
Experimental results on multiple tasks from MIMIC-IV/MIMIC-CXR/MIMIC-NOTE datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-17T12:03:10Z) - Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions.
VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information.
We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z) - A Generalist Learner for Multifaceted Medical Image Interpretation [14.75683710779724]
We propose MedVersa, a generalist learner that enables flexible learning and tasking for medical image interpretation.
By leveraging a large language model as a learnable orchestrator, MedVersa can learn from both visual and linguistic supervision, support multimodal inputs, and perform real-time task specification.
Our experiments demonstrate that MedVersa achieves state-of-the-art performance in 9 tasks, sometimes outperforming specialist counterparts by over 10%.
arXiv Detail & Related papers (2024-05-13T17:58:51Z) - Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models [17.643421997037514]
We propose a novel framework that tackles both discriminative and generative multimodal medical tasks.
The learning of Med-MoE consists of three steps: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning.
Our model can achieve performance superior to or on par with state-of-the-art baselines.
arXiv Detail & Related papers (2024-04-16T02:35:17Z) - MISS: A Generative Pretraining and Finetuning Approach for Med-VQA [16.978523518972533]
We propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks.
We unify the text encoder and multimodal encoder and align image-text features through multi-task learning.
Our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models.
arXiv Detail & Related papers (2024-01-10T13:56:40Z) - Multi-task Paired Masking with Alignment Modeling for Medical
Vision-Language Pre-training [55.56609500764344]
We propose a unified framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework.
We also introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction.
arXiv Detail & Related papers (2023-05-13T13:53:48Z) - Towards Medical Artificial General Intelligence via Knowledge-Enhanced
Multimodal Pretraining [121.89793208683625]
Medical artificial general intelligence (MAGI) enables one foundation model to solve different medical tasks.
We propose a new paradigm called Medical-knedge-enhanced mulTimOdal pretRaining (MOTOR)
arXiv Detail & Related papers (2023-04-26T01:26:19Z) - Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z) - Multi-modal Understanding and Generation for Medical Images and Text via
Vision-Language Pre-Training [5.119201893752376]
We propose Medical Vision Language Learner (MedViLL) which adopts a Transformer-based architecture combined with a novel multimodal attention masking scheme.
We empirically demonstrate the superior downstream task performance of MedViLL against various baselines including task-specific architectures.
arXiv Detail & Related papers (2021-05-24T15:14:09Z) - Cross-Modal Information Maximization for Medical Imaging: CMIM [62.28852442561818]
In hospitals, data are siloed to specific information systems that make the same information available under different modalities.
This offers unique opportunities to obtain and use at train-time those multiple views of the same information that might not always be available at test-time.
We propose an innovative framework that makes the most of available data by learning good representations of a multi-modal input that are resilient to modality dropping at test-time.
arXiv Detail & Related papers (2020-10-20T20:05:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.