Related papers: A Survey of Medical Vision-and-Language Applications and Their Techniques

A Survey of Medical Vision-and-Language Applications and Their Techniques

URL: http://arxiv.org/abs/2411.12195v1
Date: Tue, 19 Nov 2024 03:27:05 GMT
Title: A Survey of Medical Vision-and-Language Applications and Their Techniques
Authors: Qi Chen, Ruoshan Zhao, Sinuo Wang, Vu Minh Hieu Phan, Anton van den Hengel, Johan Verjans, Zhibin Liao, Minh-Son To, Yong Xia, Jian Chen, Yutong Xie, Qi Wu,
Abstract summary: Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics.
Score: 48.268198631277315
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Their applications are versatile and have the potential to improve diagnostic accuracy and decision-making for individual patients while also contributing to enhanced public health monitoring, disease surveillance, and policy-making through more efficient analysis of large data sets. MVLMS integrate natural language processing with medical images to enable a more comprehensive and contextual understanding of medical images alongside their corresponding textual information. Unlike general vision-and-language models trained on diverse, non-specialized datasets, MVLMs are purpose-built for the medical domain, automatically extracting and interpreting critical information from medical images and textual reports to support clinical decision-making. Popular clinical applications of MVLMs include automated medical report generation, medical visual question answering, medical multimodal segmentation, diagnosis and prognosis and medical image-text retrieval. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We conduct a detailed analysis of various vision-and-language model architectures, focusing on their distinct strategies for cross-modal integration/exploitation of medical visual and textual features. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics. Furthermore, we highlight potential challenges and summarize future research trends and directions. The full collection of papers and codes is available at: https://github.com/YtongXie/Medical-Vision-and-Language-Tasks-and-Methodologies-A-Survey.

Related papers

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
We build a multimodal dataset enriched with extensive medical knowledge.<n>We then introduce our medical-specialized MLLM: Lingshu.<n>Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities.
arXiv Detail & Related papers (2025-06-08T08:47:30Z)
UMIT: Unifying Medical Imaging Tasks via Vision-Language Models [17.65946656129399]
UMIT is a unified multi-modal, multi-task VLM designed specifically for medical imaging tasks. It is able to solve various tasks, including visual question answering, disease detection, and medical report generation. It supports both English and Chinese, expanding its applicability globally.
arXiv Detail & Related papers (2025-03-20T06:43:36Z)
Large Language Model Benchmarks in Medical Tasks [11.196196955468992]
This paper presents a survey of various benchmark datasets employed in medical large language models (LLMs) tasks. The survey categorizes the datasets by modality, discussing their significance, data structure, and impact on the development of LLMs. The paper emphasizes the need for datasets with a greater degree of language diversity, structured omics data, and innovative approaches to synthesis.
arXiv Detail & Related papers (2024-10-28T11:07:33Z)
ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports. Based on this dataset, we focus on the challanging task of unsupervised pretraining. We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z)
From Text to Multimodality: Exploring the Evolution and Impact of Large Language Models in Medical Practice [12.390859712280328]
Large Language Models (LLMs) have rapidly evolved from text-based systems to multimodal platforms. We examine the current landscape of MLLMs in healthcare, analyzing their applications across clinical decision support, medical imaging, patient engagement, and research.
arXiv Detail & Related papers (2024-09-14T02:35:29Z)
Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z)
Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features. We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z)
Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review [0.0]
Medical vision-language models (VLMs) combine computer vision (CV) and natural language processing (NLP) to analyze medical data. Our paper reviews recent advancements in developing models designed for medical report generation and visual question answering.
arXiv Detail & Related papers (2024-03-04T20:29:51Z)
Developing ChatGPT for Biology and Medicine: A Complete Review of Biomedical Question Answering [25.569980942498347]
ChatGPT explores a strategic blueprint of question answering (QA) in delivering medical diagnosis, treatment recommendations, and other healthcare support. This is achieved through the increasing incorporation of medical domain data via natural language processing (NLP) and multimodal paradigms.
arXiv Detail & Related papers (2024-01-15T07:21:16Z)
CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare [16.033112094191395]
We introduce the Multimodal Medical Question Summarization (MMQS) dataset. This dataset pairs medical queries with visual aids, facilitating a richer and more nuanced understanding of patient needs. We also propose a framework, consisting of four modules that identify medical disorders, generate relevant context, filter medical concepts, and craft visually aware summaries.
arXiv Detail & Related papers (2023-12-16T03:02:05Z)
Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge [68.90835997085557]
We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives. First, we align the representations of the vision encoder and the language encoder through knowledge. Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text. Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
arXiv Detail & Related papers (2022-09-15T08:00:01Z)
Cross-Modal Information Maximization for Medical Imaging: CMIM [62.28852442561818]
In hospitals, data are siloed to specific information systems that make the same information available under different modalities. This offers unique opportunities to obtain and use at train-time those multiple views of the same information that might not always be available at test-time. We propose an innovative framework that makes the most of available data by learning good representations of a multi-modal input that are resilient to modality dropping at test-time.
arXiv Detail & Related papers (2020-10-20T20:05:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.