Related papers: ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue

ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue

URL: http://arxiv.org/abs/2409.17610v2
Date: Tue, 29 Oct 2024 18:48:01 GMT
Title: ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue
Authors: Zhangpu Li, Changhong Zou, Suxue Ma, Zhicheng Yang, Chen Du, Youbao Tang, Zhenjie Cao, Ning Zhang, Jui-Hsin Lai, Ruei-Sung Lin, Yuan Ni, Xingzhi Sun, Jing Xiao, Jieke Hou, Kai Zhang, Mei Han,
Abstract summary: In our online medical consultation scenario, a doctor responds to the texts and images provided by a patient in multiple rounds to diagnose her/his health condition. Unlike high-quality images captured by professional equipment in traditional medical visual question answering (Med-VQA), the images in our case are taken by patients' mobile phones. We propose ZALM3, a Zero-shot strategy to improve vision-language alignment in Multi-turn Multimodal Medical dialogue.
Score: 25.398370966763597
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rocketing prosperity of large language models (LLMs) in recent years has boosted the prevalence of vision-language models (VLMs) in the medical sector. In our online medical consultation scenario, a doctor responds to the texts and images provided by a patient in multiple rounds to diagnose her/his health condition, forming a multi-turn multimodal medical dialogue format. Unlike high-quality images captured by professional equipment in traditional medical visual question answering (Med-VQA), the images in our case are taken by patients' mobile phones. These images have poor quality control, with issues such as excessive background elements and the lesion area being significantly off-center, leading to degradation of vision-language alignment in the model training phase. In this paper, we propose ZALM3, a Zero-shot strategy to improve vision-language ALignment in Multi-turn Multimodal Medical dialogue. Since we observe that the preceding text conversations before an image can infer the regions of interest (RoIs) in the image, ZALM3 employs an LLM to summarize the keywords from the preceding context and a visual grounding model to extract the RoIs. The updated images eliminate unnecessary background noise and provide more effective vision-language alignment. To better evaluate our proposed method, we design a new subjective assessment metric for multi-turn unimodal/multimodal medical dialogue to provide a fine-grained performance comparison. Our experiments across three different clinical departments remarkably demonstrate the efficacy of ZALM3 with statistical significance.

Related papers

MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations [13.991376926757036]
We propose MedUnifier, a unified Vision-Language Pre-Training framework tailored for medical data. MedUnifier seamlessly integrates text-grounded image generation capabilities with multi-modal learning strategies. Our approach employs visual vector quantization, which not only facilitates a more cohesive learning strategy for cross-modal understanding but also enhances multi-modal generation quality.
arXiv Detail & Related papers (2025-03-02T21:09:32Z)
GCS-M3VLT: Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer for Retinal Image Captioning [3.5948668755510136]
We propose a novel vision-language model for retinal image captioning that combines visual and textual features through a guided context self-attention mechanism. Experiments on the DeepEyeNet dataset demonstrate a 0.023 BLEU@4 improvement, along with significant qualitative advancements.
arXiv Detail & Related papers (2024-12-23T03:49:29Z)
LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound [7.941670191244354]
We propose a fine-grained adaptive VLM architecture for Chinese medical visual conversations through parameter-efficient tuning. Specifically, we devise a fusion module with fine-grained vision encoders to achieve enhancement for subtle medical visual semantics. For execution, we leverage a large-scale multimodal Chinese ultrasound dataset obtained from the hospital.
arXiv Detail & Related papers (2024-10-19T11:38:31Z)
Autoregressive Sequence Modeling for 3D Medical Image Representation [48.706230961589924]
We introduce a pioneering method for learning 3D medical image representations through an autoregressive sequence pre-training framework. Our approach various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence.
arXiv Detail & Related papers (2024-09-13T10:19:10Z)
M3T: Multi-Modal Medical Transformer to bridge Clinical Context with Visual Insights for Retinal Image Medical Description Generation [1.8789068567093286]
Multi-Modal Medical Transformer (M3T) is a novel deep learning architecture that integrates visual representations with diagnostic keywords. Experimental studies on the DeepEyeNet dataset validate the success of M3T in meeting ophthalmologists' standards.
arXiv Detail & Related papers (2024-06-19T00:46:48Z)
QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation Challenge [93.61262892578067]
Uncertainty in medical image segmentation tasks, especially inter-rater variability, presents a significant challenge. This variability directly impacts the development and evaluation of automated segmentation algorithms. We report the set-up and summarize the benchmark results of the Quantification of Uncertainties in Biomedical Image Quantification Challenge (QUBIQ)
arXiv Detail & Related papers (2024-03-19T17:57:24Z)
Unified Medical Image Pre-training in Language-Guided Common Semantic Space [39.61770813855078]
We propose an Unified Medical Image Pre-training framework, namely UniMedI. UniMedI uses diagnostic reports as common semantic space to create unified representations for diverse modalities of medical images. We evaluate its performance on both 2D and 3D images across 10 different datasets.
arXiv Detail & Related papers (2023-11-24T22:01:12Z)
Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text. The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z)
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model. It can analyze and answer open-ended questions about chest radiographs. We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z)
Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation [37.93699188912036]
We present Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image (GTGM) GTGM generates medical-style text from 3D medical images without relying on paired descriptions. Negative-free contrastive learning objective strategy is introduced to cultivate consistent visual representations between augmented 3D medical image patches.
arXiv Detail & Related papers (2023-06-07T22:20:51Z)
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. We explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge [68.90835997085557]
We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives. First, we align the representations of the vision encoder and the language encoder through knowledge. Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text. Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
arXiv Detail & Related papers (2022-09-15T08:00:01Z)
Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders. We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.