Related papers: MedForget: Hierarchy-Aware Multimodal Unlearning Testbed for Medical AI

MedForget: Hierarchy-Aware Multimodal Unlearning Testbed for Medical AI

URL: http://arxiv.org/abs/2512.09867v1
Date: Wed, 10 Dec 2025 17:55:06 GMT
Title: MedForget: Hierarchy-Aware Multimodal Unlearning Testbed for Medical AI
Authors: Fengli Wu, Vaidehi Patil, Jaehong Yoon, Yue Zhang, Mohit Bansal,
Abstract summary: MedForget is a hierarchy-aware multimodal unlearning testbed for building compliant medical AI systems.<n>We show that existing methods struggle to achieve complete, hierarchy-aware forgetting without reducing diagnostic performance.<n>We introduce a reconstruction attack that progressively adds hierarchical level context to prompts.
Score: 66.0701326117134
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pretrained Multimodal Large Language Models (MLLMs) are increasingly deployed in medical AI systems for clinical reasoning, diagnosis support, and report generation. However, their training on sensitive patient data raises critical privacy and compliance challenges under regulations such as HIPAA and GDPR, which enforce the "right to be forgotten". Unlearning, the process of tuning models to selectively remove the influence of specific training data points, offers a potential solution, yet its effectiveness in complex medical settings remains underexplored. To systematically study this, we introduce MedForget, a Hierarchy-Aware Multimodal Unlearning Testbed with explicit retain and forget splits and evaluation sets containing rephrased variants. MedForget models hospital data as a nested hierarchy (Institution -> Patient -> Study -> Section), enabling fine-grained assessment across eight organizational levels. The benchmark contains 3840 multimodal (image, question, answer) instances, each hierarchy level having a dedicated unlearning target, reflecting distinct unlearning challenges. Experiments with four SOTA unlearning methods on three tasks (generation, classification, cloze) show that existing methods struggle to achieve complete, hierarchy-aware forgetting without reducing diagnostic performance. To test whether unlearning truly deletes hierarchical pathways, we introduce a reconstruction attack that progressively adds hierarchical level context to prompts. Models unlearned at a coarse granularity show strong resistance, while fine-grained unlearning leaves models vulnerable to such reconstruction. MedForget provides a practical, HIPAA-aligned testbed for building compliant medical AI systems.

Related papers

EvoClinician: A Self-Evolving Agent for Multi-Turn Medical Diagnosis via Test-Time Evolutionary Learning [72.70291772077738]
We propose Med-Inquire, a new benchmark designed to evaluate an agent's ability to perform multi-turn diagnosis.<n>We then introduce EvoClinician, a self-evolving agent that learns efficient diagnostic strategies at test time.<n>Our experiments show EvoClinician outperforms continual learning baselines and other self-evolving agents like memory agents.
arXiv Detail & Related papers (2026-01-30T13:26:18Z)
MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z)
TACL: Threshold-Adaptive Curriculum Learning Strategy for Enhancing Medical Text Understanding [8.188646882370792]
We present TACL (Threshold-Adaptive Curriculum Learning), a novel framework designed to rethink how models interact with medical texts during training.<n>By categorizing data into difficulty levels and prioritizing simpler cases early in training, TACL builds a strong foundation before tackling more complex records.<n>We observe significant improvements across diverse clinical tasks, including automatic ICD coding, readmission prediction and TCM syndrome differentiation.
arXiv Detail & Related papers (2025-10-17T03:16:51Z)
Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer's Disease [3.46857682956989]
Medical vision-language models (Med-VLMs) have shown impressive results in tasks such as report generation and visual question answering.<n>Most existing models are typically trained from scratch or fine-tuned on large-scale 2D image-text pairs.<n>We propose a data-efficient fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI.
arXiv Detail & Related papers (2025-09-09T11:36:21Z)
Towards Efficient Prompt-based Continual Learning in Distributed Medical AI [0.13265175299265505]
Modern AI models achieve state-of-the-art performance with large-scale, high-quality datasets.<n>Ethical, social, and institutional constraints in the medical domain severely restrict data sharing.<n>We propose a prompt-based continual learning (PCL) approach featuring a unified prompt pool with a minimal expansion strategy.
arXiv Detail & Related papers (2025-08-14T06:46:14Z)
Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models [3.3091869879941687]
We investigate fine-tuning Vision-Language Models (VLMs) for multi-task medical image understanding.<n>We reformulate each task into instruction-based prompts suitable for vision-language reasoning.<n>Results show that multi-task training improves robustness and accuracy.
arXiv Detail & Related papers (2025-05-22T13:18:44Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed. In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset. We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z)
Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites: A Federated Learning Approach with Noise-Resilient Training [75.40980802817349]
Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area. We introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions. We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites.
arXiv Detail & Related papers (2023-08-31T00:36:10Z)
DOCTOR: A Multi-Disease Detection Continual Learning Framework Based on Wearable Medical Sensors [3.088223994180069]
We propose DOCTOR, a multi-disease detection continual learning framework based on wearable medical sensors (WMSs) It employs a multi-headed deep neural network (DNN) and a replay-style CL algorithm. It achieves 1.43 times better average test accuracy, 1.25 times better F1-score, and 0.41 higher backward transfer than the naive fine-tuning framework.
arXiv Detail & Related papers (2023-05-09T19:33:17Z)
Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching. We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders. We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z)
Curriculum learning for improved femur fracture classification: scheduling data with prior knowledge and uncertainty [36.54112505898611]
We propose a method for the automatic classification of proximal femur fractures into 3 and 7 AO classes based on a Convolutional Neural Network (CNN) Our novel formulation reunites three curriculum strategies: individually weighting training samples, reordering the training set, and sampling subsets of data. The curriculum improves proximal femur fracture classification up to the performance of experienced trauma surgeons.
arXiv Detail & Related papers (2020-07-31T14:28:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.