Training Medical Large Vision-Language Models with Abnormal-Aware Feedback
- URL: http://arxiv.org/abs/2501.01377v1
- Date: Thu, 02 Jan 2025 17:37:20 GMT
- Title: Training Medical Large Vision-Language Models with Abnormal-Aware Feedback
- Authors: Yucheng Zhou, Lingran Song, Jianbing Shen,
- Abstract summary: We propose a novel UMed-LVLM designed with Unveiling Medical abnormalities.<n>We propose a prompt method utilizing the GPT-4V to generate diagnoses based on identified abnormal areas in medical images.<n> Experimental results demonstrate that our UMed-LVLM surpasses existing Med-LVLMs in identifying and understanding medical abnormality.
- Score: 57.98393950821579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing Medical Large Vision-Language Models (Med-LVLMs), which encapsulate extensive medical knowledge, demonstrate excellent capabilities in understanding medical images and responding to human queries based on these images. However, there remain challenges in visual localization in medical images, which is crucial for abnormality detection and interpretation. To address these issues, we propose a novel UMed-LVLM designed with Unveiling Medical abnormalities. Specifically, we collect a Medical Abnormalities Unveiling (MAU) dataset and propose a two-stage training method for UMed-LVLM training. To collect MAU dataset, we propose a prompt method utilizing the GPT-4V to generate diagnoses based on identified abnormal areas in medical images. Moreover, the two-stage training method includes Abnormal-Aware Instruction Tuning and Abnormal-Aware Rewarding, comprising Abnormal Localization Rewarding and Vision Relevance Rewarding. Experimental results demonstrate that our UMed-LVLM surpasses existing Med-LVLMs in identifying and understanding medical abnormality. In addition, this work shows that enhancing the abnormality detection capabilities of Med-LVLMs significantly improves their understanding of medical images and generalization capability.
Related papers
- MedM-VL: What Makes a Good Medical LVLM? [17.94998411263113]
Large vision-language models (LVLMs) offer new solutions for solving complex medical tasks.
We build on the popular LLaVA framework to explore model architectures and training strategies for both 2D and 3D medical LVLMs.
We release a modular, MedM-VL, and two pre-trained models: MedM-VL-2D for 2D medical image analysis and MedM-VL-CT-Chest for 3D CT-based applications.
arXiv Detail & Related papers (2025-04-06T01:44:46Z) - Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions [11.503540826701807]
We introduce a novel approach to enhance VLM performance in medical abnormality detection and localization.
We focus on breaking down medical concepts into fundamental attributes and common visual patterns.
We evaluate our method on the 0.23B Florence-2 base model and demonstrate that it achieves comparable performance in abnormality grounding to significantly larger 7B LLaVA-based medical VLMs.
arXiv Detail & Related papers (2025-03-05T09:02:33Z) - MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context [21.562034852024272]
Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data.
Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets.
We introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific LVLMs.
arXiv Detail & Related papers (2024-07-03T00:59:03Z) - CoMT: Chain-of-Medical-Thought Reduces Hallucination in Medical Report Generation [20.59298361626719]
We propose a chain-of-medical-thought approach (CoMT) to mitigate hallucinations in medical report generation.
CoMT intends to imitate the cognitive process of human doctors by decomposing diagnostic procedures.
arXiv Detail & Related papers (2024-06-17T12:03:32Z) - Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions.
VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information.
We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z) - MediCLIP: Adapting CLIP for Few-shot Medical Image Anomaly Detection [6.812281925604158]
This paper first focuses on the task of medical image anomaly detection in the few-shot setting.
We propose an innovative approach, MediCLIP, which adapts the CLIP model to few-shot medical image anomaly detection through self-supervised fine-tuning.
arXiv Detail & Related papers (2024-05-18T15:24:58Z) - OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM [48.16696073640864]
We introduce OmniMedVQA, a novel comprehensive medical Visual Question Answering (VQA) benchmark.
All images in this benchmark are sourced from authentic medical scenarios.
We have found that existing LVLMs struggle to address these medical VQA problems effectively.
arXiv Detail & Related papers (2024-02-14T13:51:56Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - Towards Medical Artificial General Intelligence via Knowledge-Enhanced
Multimodal Pretraining [121.89793208683625]
Medical artificial general intelligence (MAGI) enables one foundation model to solve different medical tasks.
We propose a new paradigm called Medical-knedge-enhanced mulTimOdal pretRaining (MOTOR)
arXiv Detail & Related papers (2023-04-26T01:26:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.