Related papers: HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models

HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models

URL: http://arxiv.org/abs/2506.00805v1
Date: Sun, 01 Jun 2025 03:11:00 GMT
Title: HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models
Authors: Songtao Jiang, Yan Zhang, Yeying Jin, Zhihang Tang, Yangyang Wu, Yang Feng, Jian Wu, Zuozhu Liu,
Abstract summary: We propose Hierarchical Self-Contrastive Rewarding (HSCR), a novel approach that addresses two critical challenges in Med-VLM alignment.<n>HSCR generates high-quality preference data and captures nuanced and context-aware preferences for improved alignment.
Score: 23.158036246184174
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Medical Vision-Language Models (Med-VLMs) have achieved success across various tasks, yet most existing methods overlook the modality misalignment issue that can lead to untrustworthy responses in clinical settings. In this paper, we propose Hierarchical Self-Contrastive Rewarding (HSCR), a novel approach that addresses two critical challenges in Med-VLM alignment: 1) Cost-effective generation of high-quality preference data; 2) Capturing nuanced and context-aware preferences for improved alignment. HSCR first leverages the inherent capability of Med-VLMs to generate dispreferred responses with higher sampling probability. By analyzing output logit shifts after visual token dropout, we identify modality-coupled tokens that induce misalignment and derive an implicit alignment reward function. This function guides token replacement with hallucinated ones during decoding, producing high-quality dispreferred data. Furthermore, HSCR introduces a multi-level preference optimization strategy, which extends beyond traditional adjacent-level optimization by incorporating nuanced implicit preferences, leveraging relative quality in dispreferred data to capture subtle alignment cues for more precise and context-aware optimization. Extensive experiments across multiple medical tasks, including Med-VQA, medical image captioning and instruction following, demonstrate that HSCR not only enhances zero-shot performance but also significantly improves modality alignment and trustworthiness with just 2,000 training entries.

Related papers

ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models [24.19721015692576]
We propose ClinCoT to transform preference optimization from response-level correction to visual-driven reasoning.<n>We show that ClinCoT consistently improves factual grounding and achieves superior performance compared with existing preference-based alignment methods.
arXiv Detail & Related papers (2026-03-01T14:15:54Z)
PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding [85.22047087898311]
We introduce Polarity-Prompt Contrastive Decoding (PromptCD), a test-time behavior control method that generalizes contrastive decoding to broader enhancement settings.<n>PromptCD constructs paired positive and negative guiding prompts for a target behavior and contrasts model responses to reinforce desirable outcomes.<n>Experiments on the "3H" alignment objectives demonstrate consistent and substantial improvements, indicating that post-trained models can achieve meaningful self-enhancement purely at test time.
arXiv Detail & Related papers (2026-02-24T08:56:52Z)
Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm [7.449373800890174]
Reinforcement learning for large language model alignment has progressed rapidly in recent years.<n> transferring these paradigms to high-stakes medical question answering reveals a fundamental paradigm mismatch.<n>We propose a robust medical alignment paradigm to address these challenges.
arXiv Detail & Related papers (2026-02-12T07:26:23Z)
MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning [52.064286116035134]
We develop MedAlign, a framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA)<n>We first propose a multimodal Direct Preference Optimization (mDPO) objective to align preference learning with visual context.<n>We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM.
arXiv Detail & Related papers (2025-10-24T02:11:05Z)
Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift [12.373281238541296]
Medical vision-language models (VLMs) offer promise for clinical decision support, yet their reliability under distribution shifts remains a major concern for safe deployment.<n>We propose DRiFt, a structured feature decoupling framework that explicitly separates clinically relevant signals from task-agnostic noise.<n>Our approach improves in-distribution performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over prior prompt-based methods.
arXiv Detail & Related papers (2025-09-11T12:26:57Z)
MedSeqFT: Sequential Fine-tuning Foundation Models for 3D Medical Image Segmentation [55.37355146924576]
MedSeqFT is a sequential fine-tuning framework for medical image analysis.<n>It adapts pre-trained models to new tasks while refining their representational capacity.<n>It consistently outperforms state-of-the-art fine-tuning strategies.
arXiv Detail & Related papers (2025-09-07T15:22:53Z)
Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z)
Efficient Medical VIE via Reinforcement Learning [10.713109515157475]
Visual Information Extraction (VIE) converts unstructured document images into structured formats like, structured formats like, critical for medical applications like report analysis and online consultations.<n>Traditional methods rely on OCR and language models, while end-to-end multimodal models offer direct generation.<n>We base our approach on the Reinforcement Learning with Verifiable Rewards (RLVR) framework to address these challenges using only 100 annotated samples.
arXiv Detail & Related papers (2025-06-16T11:10:25Z)
CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making [42.28216499263317]
We introduce Med-Zero-17K, a curated dataset for pure RL-based training, encompassing over 30 medical image modalities and 24 clinical tasks.<n>We propose a novel large-scale RL framework for Med-VLMs, which integrates rewards to ensure fidelity between perception and reasoning, consistency in reasoning-to-answer, and rule-based accuracy for final responses.
arXiv Detail & Related papers (2025-06-15T13:42:46Z)
Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs [51.93737995405164]
Large Vision-Language Models (LVLMs) are susceptible to hallucinations.<n>We introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy.<n>We show that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
arXiv Detail & Related papers (2025-05-26T08:36:10Z)
T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation [60.620408007636016]
We propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores.<n>Our approach integrates Group Relative Policy Optimization into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains.
arXiv Detail & Related papers (2025-05-23T13:44:59Z)
Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks [81.44256822500257]
RLHF has emerged as a predominant approach for aligning artificial intelligence systems with human preferences.<n> RLHF exhibits insufficient compliance capabilities when confronted with complex multi-instruction tasks.<n>We propose a novel Multi-level Aware Preference Learning (MAPL) framework, capable of enhancing multi-instruction capabilities.
arXiv Detail & Related papers (2025-05-19T08:33:11Z)
Efficient Few-Shot Medical Image Analysis via Hierarchical Contrastive Vision-Language Learning [44.99833362998488]
We propose Adaptive Vision-Language Fine-tuning with Hierarchical Contrastive Alignment (HiCA) for medical image analysis.<n>HiCA combines domain-specific pretraining and hierarchical contrastive learning to align visual and textual representations at multiple levels.<n>We evaluate our approach on two benchmark datasets, Chest X-ray and Breast Ultrasound.
arXiv Detail & Related papers (2025-01-16T05:01:30Z)
A Systematic Examination of Preference Learning through the Lens of Instruction-Following [83.71180850955679]
We use a novel synthetic data generation pipeline to generate 48,000 instruction unique-following prompts.<n>With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS)<n>Experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements.<n>High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance.
arXiv Detail & Related papers (2024-12-18T15:38:39Z)
MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization [25.937453082034448]
We propose MMedPO, a novel multimodal medical preference optimization approach.<n> MMedPO considers the clinical relevance of preference samples to enhance Med-LVLM alignment.<n>Our experiments demonstrate that MMedPO significantly enhances factual accuracy in Med-LVLMs.
arXiv Detail & Related papers (2024-12-09T01:50:39Z)
Systematic Reward Gap Optimization for Mitigating VLM Hallucinations [34.71750379630014]
We introduce Topic-level Preference Rewriting (TPR), a novel framework designed for the systematic optimization of reward gap configuration.<n>TPR provides topic-level control over fine-grained semantic details, enabling advanced data curation strategies.<n>It significantly reduces hallucinations by up to 93% on ObjectHal-Bench, and also exhibits superior data efficiency towards robust and cost-effective VLM alignment.
arXiv Detail & Related papers (2024-11-26T09:42:07Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
A Knowledge-based Learning Framework for Self-supervised Pre-training Towards Enhanced Recognition of Medical Images [14.304996977665212]
This study proposes a knowledge-based learning framework towards enhanced recognition of medical images. It works in three phases by synergizing contrastive learning and generative learning models. The proposed framework statistically excels in self-supervised benchmarks, achieving 2.08, 1.23, 1.12, 0.76 and 1.38 percentage points improvements over SimCLR in AUC/Dice.
arXiv Detail & Related papers (2022-11-27T03:58:58Z)
Greedy based Value Representation for Optimal Coordination in Multi-agent Reinforcement Learning [64.05646120624287]
We derive the expression of the joint Q value function of LVD and MVD. To ensure optimal consistency, the optimal node is required to be the unique STN. Our method outperforms state-of-the-art baselines in experiments on various benchmarks.
arXiv Detail & Related papers (2022-11-22T08:14:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.