WangLab at MEDIQA-M3G 2024: Multimodal Medical Answer Generation using Large Language Models
- URL: http://arxiv.org/abs/2404.14567v1
- Date: Mon, 22 Apr 2024 20:29:58 GMT
- Title: WangLab at MEDIQA-M3G 2024: Multimodal Medical Answer Generation using Large Language Models
- Authors: Ronald Xie, Steven Palayew, Augustin Toma, Gary Bader, Bo Wang,
- Abstract summary: We report results for two standalone solutions under the English category of the task.
We identify the multi-stage LLM approach and the CLIP image classification approach as promising avenues for further investigation.
- Score: 5.7931394318054155
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper outlines our submission to the MEDIQA2024 Multilingual and Multimodal Medical Answer Generation (M3G) shared task. We report results for two standalone solutions under the English category of the task, the first involving two consecutive API calls to the Claude 3 Opus API and the second involving training an image-disease label joint embedding in the style of CLIP for image classification. These two solutions scored 1st and 2nd place respectively on the competition leaderboard, substantially outperforming the next best solution. Additionally, we discuss insights gained from post-competition experiments. While the performance of these two solutions have significant room for improvement due to the difficulty of the shared task and the challenging nature of medical visual question answering in general, we identify the multi-stage LLM approach and the CLIP image classification approach as promising avenues for further investigation.
Related papers
- Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z) - Cultivating Multimodal Intelligence: Interpretive Reasoning and Agentic RAG Approaches to Dermatological Diagnosis [0.0]
The second edition of the 2025 ImageCLEF MEDIQA-MAGIC challenge focuses on multimodal dermatology question answering and segmentation.<n>This work addresses the Closed Visual Question Answering (CVQA) task, where the goal is to select the correct answer to multiple-choice clinical questions.<n>The team achieved second place with a submission that scored sixth, demonstrating competitive performance and high accuracy.
arXiv Detail & Related papers (2025-07-07T22:31:56Z) - MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models [48.24824129683951]
We introduce medical image reasoning segmentation, a novel task that aims to generate segmentation masks based on complex and implicit medical instructions.<n>To address this, we propose MedSeg-R, an end-to-end framework that leverages the reasoning abilities of MLLMs to interpret clinical questions.<n>It is built on two core components: 1) a global context understanding module that interprets images and comprehends complex medical instructions to generate multi-modal intermediate tokens, and 2) a pixel-level grounding module that decodes these tokens to produce precise segmentation masks.
arXiv Detail & Related papers (2025-06-12T08:13:38Z) - ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning [62.61187785810336]
ImageScope is a training-free, three-stage framework that unifies language-guided image retrieval tasks.
In the first stage, we improve the robustness of the framework by synthesizing search intents across varying levels of semantic granularity.
In the second and third stages, we reflect on retrieval results by verifying predicate propositions locally, and performing pairwise evaluations globally.
arXiv Detail & Related papers (2025-03-13T08:43:24Z) - LIMIS: Towards Language-based Interactive Medical Image Segmentation [58.553786162527686]
LIMIS is the first purely language-based interactive medical image segmentation model.
We adapt Grounded SAM to the medical domain and design a language-based model interaction strategy.
We evaluate LIMIS on three publicly available medical datasets in terms of performance and usability.
arXiv Detail & Related papers (2024-10-22T12:13:47Z) - Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE [17.94158825878658]
Multi-modal large language models (MLLMs) have shown impressive capabilities as a general-purpose interface for various visual and linguistic tasks.
Uni-Med is a novel medical generalist foundation model which consists of a universal visual feature extraction module, a connector mixture-of-experts (CMoE) module, and an LLM.
To the best of our knowledge, Uni-Med is the first effort to tackle multi-task interference at the connector in MLLMs.
arXiv Detail & Related papers (2024-09-26T03:33:26Z) - ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning [16.03318708001763]
We perform a comprehensive benchmarking of contrastive frameworks for learning multimodal representations in the medical domain.
Our findings suggest a positive answer to the first question, a negative answer to the second question, and the benefit of learning fine-grained features.
arXiv Detail & Related papers (2024-06-11T16:55:38Z) - MediFact at MEDIQA-M3G 2024: Medical Question Answering in Dermatology with Multimodal Learning [0.0]
This paper addresses the limitations of traditional methods by proposing a weakly supervised learning approach for open-ended medical question-answering (QA)
Our system leverages readily available MEDIQA-M3G images via a VGG16-CNN-SVM model, enabling multilingual learning of informative skin condition representations.
This work advances medical QA research, paving the way for clinical decision support systems and ultimately improving healthcare delivery.
arXiv Detail & Related papers (2024-04-27T20:03:47Z) - CLIP in Medical Imaging: A Comprehensive Survey [59.429714742927956]
Contrastive Language-Image Pre-training successfully introduces text supervision to vision models.
It has shown promising results across various tasks, attributable to its generalizability and interpretability.
Use of CLIP has recently gained increasing interest in the medical imaging domain.
arXiv Detail & Related papers (2023-12-12T15:21:57Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - Multi-task Paired Masking with Alignment Modeling for Medical
Vision-Language Pre-training [55.56609500764344]
We propose a unified framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework.
We also introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction.
arXiv Detail & Related papers (2023-05-13T13:53:48Z) - Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z) - Contrastive Semi-Supervised Learning for 2D Medical Image Segmentation [16.517086214275654]
We present a novel semi-supervised 2D medical segmentation solution that applies Contrastive Learning (CL) on image patches, instead of full images.
These patches are meaningfully constructed using the semantic information of different classes obtained via pseudo labeling.
We also propose a novel consistency regularization scheme, which works in synergy with contrastive learning.
arXiv Detail & Related papers (2021-06-12T15:43:24Z) - Collaborative Unsupervised Domain Adaptation for Medical Image Diagnosis [102.40869566439514]
We seek to exploit rich labeled data from relevant domains to help the learning in the target task via Unsupervised Domain Adaptation (UDA)
Unlike most UDA methods that rely on clean labeled data or assume samples are equally transferable, we innovatively propose a Collaborative Unsupervised Domain Adaptation algorithm.
We theoretically analyze the generalization performance of the proposed method, and also empirically evaluate it on both medical and general images.
arXiv Detail & Related papers (2020-07-05T11:49:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.