Open-Ended Medical Visual Question Answering Through Prefix Tuning of
Language Models
- URL: http://arxiv.org/abs/2303.05977v2
- Date: Fri, 21 Jul 2023 22:33:49 GMT
- Title: Open-Ended Medical Visual Question Answering Through Prefix Tuning of
Language Models
- Authors: Tom van Sonsbeek, Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees
G. M. Snoek and Marcel Worring
- Abstract summary: We focus on open-ended VQA and motivated by the recent advances in language models consider it as a generative task.
To properly communicate the medical images to the language model, we develop a network that maps the extracted visual features to a set of learnable tokens.
We evaluate our approach on the prime medical VQA benchmarks, namely, Slake, OVQA and PathVQA.
- Score: 42.360431316298204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical Visual Question Answering (VQA) is an important challenge, as it
would lead to faster and more accurate diagnoses and treatment decisions. Most
existing methods approach it as a multi-class classification problem, which
restricts the outcome to a predefined closed-set of curated answers. We focus
on open-ended VQA and motivated by the recent advances in language models
consider it as a generative task. Leveraging pre-trained language models, we
introduce a novel method particularly suited for small, domain-specific,
medical datasets. To properly communicate the medical images to the language
model, we develop a network that maps the extracted visual features to a set of
learnable tokens. Then, alongside the question, these learnable tokens directly
prompt the language model. We explore recent parameter-efficient fine-tuning
strategies for language models, which allow for resource- and data-efficient
fine-tuning. We evaluate our approach on the prime medical VQA benchmarks,
namely, Slake, OVQA and PathVQA. The results demonstrate that our approach
outperforms existing methods across various training settings while also being
computationally efficient.
Related papers
- LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning.
In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach.
Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Towards Efficient Methods in Medical Question Answering using Knowledge Graph Embeddings [3.944219308229571]
In Natural Language Processing (NLP), Machine Reading (MRC) is the task of answering a question based on a given context.
Modern language models such as BioBERT, SciBERT and even ChatGPT are trained on vast amounts of in-domain medical corpora.
We propose a resource-efficient approach for injecting domain knowledge into a model without relying on such domain-specific pre-training.
arXiv Detail & Related papers (2024-01-15T21:43:46Z) - MISS: A Generative Pretraining and Finetuning Approach for Med-VQA [16.978523518972533]
We propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks.
We unify the text encoder and multimodal encoder and align image-text features through multi-task learning.
Our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models.
arXiv Detail & Related papers (2024-01-10T13:56:40Z) - Visual Question Answering in the Medical Domain [13.673890873313354]
We present a novel contrastive learning pretraining method to mitigate the problem of small datasets for the Med-VQA task.
Our proposed model obtained an accuracy of 60% on the VQA-Med 2019 test set, giving comparable results to other state-of-the-art Med-VQA models.
arXiv Detail & Related papers (2023-09-20T06:06:10Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss [73.65872901950135]
This work attempts to tackle the language prior problem from the viewpoint of the feature space learning.
An adapted margin cosine loss is designed to discriminate the frequent and the sparse answer feature space.
Experimental results demonstrate that our adapted margin cosine loss can greatly enhance the baseline models.
arXiv Detail & Related papers (2021-05-05T11:41:38Z) - MMBERT: Multimodal BERT Pretraining for Improved Medical VQA [23.78515287446131]
We propose a solution inspired by self-supervised pretraining of Transformer-style architectures for NLP, Vision and Language tasks.
Our method involves learning richer medical image and text semantic representations using Masked Language Modeling.
The proposed solution achieves new state-of-the-art performance on two VQA datasets for radiology images.
arXiv Detail & Related papers (2021-04-03T13:01:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.