Related papers: MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning

MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning

URL: http://arxiv.org/abs/2510.21093v1
Date: Fri, 24 Oct 2025 02:11:05 GMT
Title: MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning
Authors: Siyong Chen, Jinbo Wen, Jiawen Kang, Tenghui Huang, Xumin Huang, Yuanjia Su, Hudan Pan, Zishao Zhong, Dusit Niyato, Shengli Xie, Dong In Kim,
Abstract summary: We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
Score: 52.064286116035134
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, large models have shown significant potential for smart healthcare. However, the deployment of Large Vision-Language Models (LVLMs) for clinical services is currently hindered by three critical challenges: a tendency to hallucinate answers not grounded in visual evidence, the inefficiency of fixed-depth reasoning, and the difficulty of multi-institutional collaboration. To address these challenges, in this paper, we develop MedAlign, a novel framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA). Specifically, we first propose a multimodal Direct Preference Optimization (mDPO) objective to explicitly align preference learning with visual context. We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM (i.e., an expert), thereby mitigating hallucinations in LVLMs. To achieve adaptive reasoning and facilitate multi-institutional collaboration, we propose a federated governance mechanism, where the selected expert, fine-tuned on clinical datasets based on mDPO, locally performs iterative Chain-of-Thought (CoT) reasoning via the local meta-cognitive uncertainty estimator. Extensive experiments on three representative Med-VQA datasets demonstrate that MedAlign achieves state-of-the-art performance, outperforming strong retrieval-augmented baselines by up to $11.85\%$ in F1-score, and simultaneously reducing the average reasoning length by $51.60\%$ compared with fixed-depth CoT approaches.

Related papers

CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework [29.22693846221723]
We introduce CARE, advancing Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework.<n> CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination.<n>Our CARE-Flow improves average accuracy by 10.9% over the same size (10B) state-of-the-art (SOTA)
arXiv Detail & Related papers (2026-03-02T08:38:37Z)
ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models [24.19721015692576]
We propose ClinCoT to transform preference optimization from response-level correction to visual-driven reasoning.<n>We show that ClinCoT consistently improves factual grounding and achieves superior performance compared with existing preference-based alignment methods.
arXiv Detail & Related papers (2026-03-01T14:15:54Z)
MedAD-R1: Eliciting Consistent Reasoning in Interpretible Medical Anomaly Detection via Consistency-Reinforced Policy Optimization [46.65200216642429]
We introduce MedAD-38K, the first large-scale, multi-modal, and multi-center benchmark for MedAD featuring diagnostic Chain-of-Thought (CoT) annotations alongside structured Visual Question-Answering (VQA) pairs.<n>Our proposed model, MedAD-R1, achieves state-of-the-art (SOTA) performance on the MedAD-38K benchmark, outperforming strong baselines by more than 10%.
arXiv Detail & Related papers (2026-02-01T07:56:10Z)
MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement [63.82954136824963]
Medical Vision-Language Models excel at perception tasks with complex clinical reasoning required in real-world scenarios.<n>We propose a novel reasoning MedVLM that addresses these challenges through domain-specific adaptation and guideline reinforcement.
arXiv Detail & Related papers (2026-01-16T02:32:07Z)
S-Chain: Structured Visual Chain-of-Thought For Medicine [81.97605645734741]
We introduce S-Chain, the first large-scale dataset of 12,000 expert-annotated medical images with bounding boxes and structured visual CoT (SV-CoT)<n>The dataset further supports 16 languages, totaling over 700k VQA pairs for broad multilingual applicability.<n>S-Chain establishes a new benchmark for grounded medical reasoning and paves the way toward more trustworthy and explainable medical vision-language models.
arXiv Detail & Related papers (2025-10-26T15:57:14Z)
Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models [15.530083855947987]
We propose the first Multimodal Medical Reasoning-with-Retrieval framework, Med-RwR.<n>Med-RwR actively retrieves external knowledge by querying observed symptoms or domain-specific medical concepts during reasoning.<n> Evaluation on various public medical benchmarks demonstrates Med-RwR's significant improvements over baseline models.
arXiv Detail & Related papers (2025-10-21T05:18:18Z)
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models [54.48710348910535]
Existing medical reasoning benchmarks primarily focus on analyzing a patient's condition based on an image from a single visit.<n>We introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits.
arXiv Detail & Related papers (2025-09-29T17:51:26Z)
Fleming-R1: Toward Expert-Level Medical Reasoning via Reinforcement Learning [6.778254993886297]
We introduce Fleming-R1, a model designed for verifiable medical reasoning through three complementary innovations.<n>First, our Reasoning-Oriented Data Strategy (RODS) combines curated medical QA datasets with knowledge-graph-guided synthesis.<n>Second, we employ Chain-of-Thought (CoT) cold start to distill high-quality reasoning trajectories from teacher models.<n>Third, we implement a two-stage Reinforcement Learning from Verifiable Rewards framework.
arXiv Detail & Related papers (2025-09-18T13:35:14Z)
MedCoT-RAG: Causal Chain-of-Thought RAG for Medical Question Answering [4.285647375182588]
Large language models (LLMs) have shown promise in medical question answering but often struggle with hallucinations and shallow reasoning.<n>Retrieval-augmented generation (RAG) offers a practical and privacy-preserving way to enhance LLMs with external medical knowledge.<n>We introduce MedCoT-RAG, a domain-specific framework that combines causal-aware document retrieval with structured chain-of-thought prompting.
arXiv Detail & Related papers (2025-08-20T05:43:26Z)
Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z)
GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning [50.94508930739623]
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images.<n>Current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers.<n>This work first proposes a Thinking with Visual Grounding dataset wherein the answer generation is decomposed into intermediate reasoning steps.<n>We introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer.
arXiv Detail & Related papers (2025-06-22T08:09:58Z)
CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making [42.28216499263317]
We introduce Med-Zero-17K, a curated dataset for pure RL-based training, encompassing over 30 medical image modalities and 24 clinical tasks.<n>We propose a novel large-scale RL framework for Med-VLMs, which integrates rewards to ensure fidelity between perception and reasoning, consistency in reasoning-to-answer, and rule-based accuracy for final responses.
arXiv Detail & Related papers (2025-06-15T13:42:46Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering [58.79671189792399]
STLLaVA-Med is designed to train a policy model capable of auto-generating medical visual instruction data. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks.
arXiv Detail & Related papers (2024-06-28T15:01:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.