Related papers: Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review

Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review

URL: http://arxiv.org/abs/2403.02469v2
Date: Mon, 15 Apr 2024 13:51:30 GMT
Title: Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review
Authors: Iryna Hartsock, Ghulam Rasool,
Abstract summary: Medical vision-language models (VLMs) combine computer vision (CV) and natural language processing (NLP) to analyze medical data. Our paper reviews recent advancements in developing models designed for medical report generation and visual question answering.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Medical vision-language models (VLMs) combine computer vision (CV) and natural language processing (NLP) to analyze visual and textual medical data. Our paper reviews recent advancements in developing VLMs specialized for healthcare, focusing on models designed for medical report generation and visual question answering (VQA). We provide background on NLP and CV, explaining how techniques from both fields are integrated into VLMs to enable learning from multimodal data. Key areas we address include the exploration of medical vision-language datasets, in-depth analyses of architectures and pre-training strategies employed in recent noteworthy medical VLMs, and comprehensive discussion on evaluation metrics for assessing VLMs' performance in medical report generation and VQA. We also highlight current challenges and propose future directions, including enhancing clinical validity and addressing patient privacy concerns. Overall, our review summarizes recent progress in developing VLMs to harness multimodal medical data for improved healthcare applications.

Related papers

Vision Language Models in Medicine [3.964982657945488]
Medical Vision-Language Models (Med-VLMs) integrate visual and textual data to enhance healthcare outcomes. The transformative impact of Med-VLMs on clinical practice, education, and patient care is highlighted. challenges like data scarcity, narrow task generalization, interpretability issues, and ethical concerns like fairness, accountability, and privacy are highlighted. Future directions include leveraging large-scale, diverse datasets, improving cross-modal generalization, and enhancing interpretability.
arXiv Detail & Related papers (2025-02-24T22:53:22Z)
A Survey of Medical Vision-and-Language Applications and Their Techniques [48.268198631277315]
Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics.
arXiv Detail & Related papers (2024-11-19T03:27:05Z)
From Text to Multimodality: Exploring the Evolution and Impact of Large Language Models in Medical Practice [12.390859712280328]
Large Language Models (LLMs) have rapidly evolved from text-based systems to multimodal platforms. We examine the current landscape of MLLMs in healthcare, analyzing their applications across clinical decision support, medical imaging, patient engagement, and research.
arXiv Detail & Related papers (2024-09-14T02:35:29Z)
STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering [58.79671189792399]
STLLaVA-Med is designed to train a policy model capable of auto-generating medical visual instruction data. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks.
arXiv Detail & Related papers (2024-06-28T15:01:23Z)
Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions. VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information. We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z)
Evaluating large language models in medical applications: a survey [1.5923327069574245]
Large language models (LLMs) have emerged as powerful tools with transformative potential across numerous domains. evaluating the performance of LLMs in medical contexts presents unique challenges due to the complex and critical nature of medical information.
arXiv Detail & Related papers (2024-05-13T05:08:33Z)
Medical Vision Language Pretraining: A survey [8.393439175704124]
Medical Vision Language Pretraining is a promising solution to the scarcity of labeled data in the medical domain. By leveraging paired/unpaired vision and text datasets through self-supervised learning, models can be trained to acquire vast knowledge and learn robust feature representations.
arXiv Detail & Related papers (2023-12-11T09:14:13Z)
Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark [12.565598914787834]
We propose RadioGraphy Captions (RGC), a high-quality, multi-modality radiographic dataset containing 18,434 image-caption pairs. RGC can be used as a pre-training dataset or a new benchmark for medical report generation and medical image-text retrieval.
arXiv Detail & Related papers (2023-06-10T17:27:33Z)
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [56.25766322554655]
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery. We propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model. We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef 2019.
arXiv Detail & Related papers (2023-05-17T17:50:16Z)
Privacy-preserving machine learning for healthcare: open challenges and future perspectives [72.43506759789861]
We conduct a review of recent literature concerning Privacy-Preserving Machine Learning (PPML) for healthcare. We primarily focus on privacy-preserving training and inference-as-a-service. The aim of this review is to guide the development of private and efficient ML models in healthcare.
arXiv Detail & Related papers (2023-03-27T19:20:51Z)
Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge [68.90835997085557]
We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives. First, we align the representations of the vision encoder and the language encoder through knowledge. Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text. Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
arXiv Detail & Related papers (2022-09-15T08:00:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.