MultiMedEdit: A Scenario-Aware Benchmark for Evaluating Knowledge Editing in Medical VQA
- URL: http://arxiv.org/abs/2508.07022v1
- Date: Sat, 09 Aug 2025 15:36:08 GMT
- Title: MultiMedEdit: A Scenario-Aware Benchmark for Evaluating Knowledge Editing in Medical VQA
- Authors: Shengtao Wen, Haodong Chen, Yadong Wang, Zhongying Pan, Xiang Chen, Yu Tian, Bo Qian, Dong Liang, Sheng-Jun Huang,
- Abstract summary: Knowledge editing (KE) provides a scalable approach for updating factual knowledge in large language models without full retraining.<n>We propose MultiMedEdit, the first benchmark tailored to evaluating KE in clinical multimodal tasks.<n>Our framework spans both understanding and reasoning task types, defines a three-dimensional metric suite (reliability, generality, and locality) and supports cross-paradigm comparisons.
- Score: 31.344312340552495
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge editing (KE) provides a scalable approach for updating factual knowledge in large language models without full retraining. While previous studies have demonstrated effectiveness in general domains and medical QA tasks, little attention has been paid to KE in multimodal medical scenarios. Unlike text-only settings, medical KE demands integrating updated knowledge with visual reasoning to support safe and interpretable clinical decisions. To address this gap, we propose MultiMedEdit, the first benchmark tailored to evaluating KE in clinical multimodal tasks. Our framework spans both understanding and reasoning task types, defines a three-dimensional metric suite (reliability, generality, and locality), and supports cross-paradigm comparisons across general and domain-specific models. We conduct extensive experiments under single-editing and lifelong-editing settings. Results suggest that current methods struggle with generalization and long-tail reasoning, particularly in complex clinical workflows. We further present an efficiency analysis (e.g., edit latency, memory footprint), revealing practical trade-offs in real-world deployment across KE paradigms. Overall, MultiMedEdit not only reveals the limitations of current approaches but also provides a solid foundation for developing clinically robust knowledge editing techniques in the future.
Related papers
- MedConsultBench: A Full-Cycle, Fine-Grained, Process-Aware Benchmark for Medical Consultation Agents [10.109613967215447]
We propose MedConsultBench, a comprehensive framework designed to evaluate the complete online consultation cycle.<n>Our methodology introduces Atomic Information Units (AIUs) to track clinical information acquisition at a sub-turn level.<n>By addressing the underspecification and ambiguity inherent in online consultations, the benchmark evaluates uncertainty-aware yet concise inquiry.
arXiv Detail & Related papers (2026-01-19T02:18:10Z) - MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement [63.82954136824963]
Medical Vision-Language Models excel at perception tasks with complex clinical reasoning required in real-world scenarios.<n>We propose a novel reasoning MedVLM that addresses these challenges through domain-specific adaptation and guideline reinforcement.
arXiv Detail & Related papers (2026-01-16T02:32:07Z) - Beyond MedQA: Towards Real-world Clinical Decision Making in the Era of LLMs [37.6690828097719]
Large language models (LLMs) show promise for clinical use.<n>Many medical datasets rely on simplified Question-Answering (QA) that underrepresents real-world clinical decision-making.<n>We propose a unifying paradigm that characterizes clinical decision-making tasks along two dimensions: Clinical Backgrounds and Clinical Questions.
arXiv Detail & Related papers (2025-10-22T20:06:10Z) - MedREK: Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts [70.64143198545031]
We propose MedREK, a retrieval-based editing framework that integrates a shared query-key module for precise matching with an attention-based prompt encoder for informative guidance.<n>Our results on various medical benchmarks demonstrate that our MedREK achieves superior performance across different core metrics.
arXiv Detail & Related papers (2025-10-15T12:50:33Z) - RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis [56.373297358647655]
Retrieval-Augmented Diagnosis (RAD) is a novel framework that injects external knowledge into multimodal models directly on downstream tasks.<n>RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss transformer, and a dual decoder.
arXiv Detail & Related papers (2025-09-24T10:36:14Z) - MedMKEB: A Comprehensive Knowledge Editing Benchmark for Medical Multimodal Large Language Models [5.253788190589279]
We present MedMKEB, the first comprehensive benchmark designed to evaluate the reliability, generality, locality, portability, and robustness of knowledge editing.<n> MedMKEB is built on a high-quality medical visual question-answering dataset and enriched with carefully constructed editing tasks.<n>We incorporate human expert validation to ensure the accuracy and reliability of the benchmark.
arXiv Detail & Related papers (2025-08-07T07:09:26Z) - Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
We build a multimodal dataset enriched with extensive medical knowledge.<n>We then introduce our medical-specialized MLLM: Lingshu.<n>Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities.
arXiv Detail & Related papers (2025-06-08T08:47:30Z) - Beyond Memorization: A Rigorous Evaluation Framework for Medical Knowledge Editing [72.8373875453882]
knowledge editing (KE) has emerged as a promising approach to update specific facts in Large Language Models (LLMs) without the need for full retraining.<n>We propose a novel framework called MedEditBench to rigorously evaluate the effectiveness of existing KE methods in the medical domain.<n>Our findings indicate that current KE methods result in only superficial memorization of the injected information, failing to generalize to new scenarios.
arXiv Detail & Related papers (2025-06-04T02:14:43Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking [58.25862290294702]
We present MedChain, a dataset of 12,163 clinical cases that covers five key stages of clinical workflow.<n>We also propose MedChain-Agent, an AI system that integrates a feedback mechanism and a MCase-RAG module to learn from previous cases and adapt its responses.
arXiv Detail & Related papers (2024-12-02T15:25:02Z) - Few Exemplar-Based General Medical Image Segmentation via Domain-Aware Selective Adaptation [28.186785488818135]
Medical image segmentation poses challenges due to domain gaps, data modality variations, and dependency on domain knowledge or experts.
We introduce a domain-aware selective adaptation approach to adapt the general knowledge learned from a large model trained with natural images to the corresponding medical domains/modalities.
arXiv Detail & Related papers (2024-10-11T21:00:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.