Related papers: Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs

Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs

URL: http://arxiv.org/abs/2510.15418v2
Date: Fri, 07 Nov 2025 05:24:14 GMT
Title: Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs
Authors: Lee Qi Zun, Mohamad Zulhilmi Bin Abdul Halim, Goh Man Fye,
Abstract summary: Retrieval-Augmented Generation systems are essential for providing fact-based guidance from Malaysian Clinical Practice Guidelines.<n>This study proposes and validates a framework to specialize the MedGemma model for generating high-fidelity captions.
Score: 0.9558392439655014
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieval-Augmented Generation systems are essential for providing fact-based guidance from Malaysian Clinical Practice Guidelines. However, their effectiveness with image-based queries is limited, as general Vision-Language Model captions often lack clinical specificity and factual grounding. This study proposes and validates a framework to specialize the MedGemma model for generating high-fidelity captions that serve as superior queries. To overcome data scarcity, we employ a knowledge distillation pipeline to create a synthetic dataset across dermatology, fundus, and chest radiography domains, and fine-tune MedGemma using the parameter-efficient QLoRA method. Performance was rigorously assessed through a dual framework measuring both classification accuracy and, via a novel application of the RAGAS framework, caption faithfulness, relevancy, and correctness. The fine-tuned model demonstrated substantial improvements in classification performance, while RAGAS evaluation confirmed significant gains in caption faithfulness and correctness, validating the models ability to produce reliable, factually grounded descriptions. This work establishes a robust pipeline for specializing medical VLMs and validates the resulting model as a high-quality query generator, laying the groundwork for enhancing multimodal RAG systems in evidence-based clinical decision support.

Related papers

CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework [29.22693846221723]
We introduce CARE, advancing Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework.<n> CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination.<n>Our CARE-Flow improves average accuracy by 10.9% over the same size (10B) state-of-the-art (SOTA)
arXiv Detail & Related papers (2026-03-02T08:38:37Z)
CLEAR-Mamba:Towards Accurate, Adaptive and Trustworthy Multi-Sequence Ophthalmic Angiography Classification [18.95392587947337]
Medical image classification is a core task in computer-aided diagnosis (CAD)<n>In ophthalmic practice, fluorescein fundus angiography (FFA) and indocyanine green angiography (ICGA) provide hemodynamic and lesion-structural information.<n>We propose CLEAR-Mamba, an enhanced framework built upon MedMamba with optimizations in both architecture and training strategy.
arXiv Detail & Related papers (2026-01-28T13:40:10Z)
MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement [63.82954136824963]
Medical Vision-Language Models excel at perception tasks with complex clinical reasoning required in real-world scenarios.<n>We propose a novel reasoning MedVLM that addresses these challenges through domain-specific adaptation and guideline reinforcement.
arXiv Detail & Related papers (2026-01-16T02:32:07Z)
Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks [21.203358914772465]
Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear.<n>We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark designed to probe the limits of multimodal clinical reasoning in neurology.
arXiv Detail & Related papers (2025-09-26T12:20:01Z)
Evaluating Large Language Models for Evidence-Based Clinical Question Answering [4.101088122511548]
Large Language Models (LLMs) have demonstrated substantial progress in biomedical and clinical applications.<n>We curate a benchmark drawing from Cochrane systematic reviews and clinical guidelines.<n>We observe consistent performance patterns across sources and clinical domains.
arXiv Detail & Related papers (2025-09-13T15:03:34Z)
Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z)
Refine Medical Diagnosis Using Generation Augmented Retrieval and Clinical Practice Guidelines [16.56254046507092]
We introduce GARMLE-G, a Generation-Augmented Retrieval framework that grounds medical language model outputs in authoritative guidelines.<n>Unlike conventional Retrieval-Augmented Generation based approaches, GARMLE-G enables hallucination-free outputs by directly retrieving authoritative guideline content.<n>A prototype system for hypertension diagnosis was developed and evaluated on multiple metrics, demonstrating superior retrieval precision, semantic relevance, and clinical guideline adherence.
arXiv Detail & Related papers (2025-06-22T11:31:13Z)
GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning [60.03671205298294]
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images.<n>Current methods still suffer from limited answer reliability and poor interpretability.<n>This work first proposes a Region-Aware Multimodal Chain-of-Thought dataset, in which the process of producing an answer is preceded by a sequence of intermediate reasoning steps.
arXiv Detail & Related papers (2025-06-22T08:09:58Z)
Metrics that matter: Evaluating image quality metrics for medical image generation [48.85783422900129]
This study comprehensively assesses commonly used no-reference image quality metrics using brain MRI data.<n>We evaluate metric sensitivity to a range of challenges, including noise, distribution shifts, and, critically, morphological alterations designed to mimic clinically relevant inaccuracies.
arXiv Detail & Related papers (2025-05-12T01:57:25Z)
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z)
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools.<n>Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses.<n>We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z)
FODA-PG for Enhanced Medical Imaging Narrative Generation: Adaptive Differentiation of Normal and Abnormal Attributes [26.912139217120874]
We propose FODA-PG, a novel Fine-grained Organ-Disease Adaptive Partitioning Graph framework. FODA-PG constructs a granular representation of radiological findings by separating disease-related attributes into distinct "disease-specific" and "disease-free" categories. By integrating this fine-grained semantic knowledge into a powerful transformer-based architecture, FODA-PG generates precise and clinically coherent reports.
arXiv Detail & Related papers (2024-09-06T00:04:35Z)
PathLDM: Text conditioned Latent Diffusion Model for Histopathology [62.970593674481414]
We introduce PathLDM, the first text-conditioned Latent Diffusion Model tailored for generating high-quality histopathology images. Our approach fuses image and textual data to enhance the generation process. We achieved a SoTA FID score of 7.64 for text-to-image generation on the TCGA-BRCA dataset, significantly outperforming the closest text-conditioned competitor with FID 30.1.
arXiv Detail & Related papers (2023-09-01T22:08:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.