Related papers: When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

URL: http://arxiv.org/abs/2603.01301v1
Date: Sun, 01 Mar 2026 22:16:19 GMT
Title: When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains
Authors: Ahmadreza Jeddi, Kimia Shaban, Negin Baghbanzadeh, Natasha Sharan, Abhishek Moturu, Elham Dolatabadi, Babak Taati,
Abstract summary: Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs)<n>It remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT)<n>We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL.
Score: 1.9256950761509062
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.

Related papers

CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z)
Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients [2.377303603725137]
We introduce ChexReason, a vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU.<n> Evaluations on CheXpert and NIH benchmarks reveal a fundamental tension: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH)<n>We identify a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization, indicating teacher-guided reasoning captures more institution-agnostic features.
arXiv Detail & Related papers (2025-12-28T21:57:42Z)
Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning [30.751908700207185]
SFT plays a crucial role across several scenarios.<n>SFT with only 2K achieves comparable or better reasoning performance to RL with 20K.<n>We identify a pervasive issue of deceptive rewards, where higher rewards fail to correlate with better reasoning accuracy in RL.
arXiv Detail & Related papers (2025-12-14T13:46:42Z)
More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models [74.10138874771852]
We propose PeRL-VL (Perception and Reasoning Learning for Vision-Language Models), a decoupled framework that separately improves visual perception and textual reasoning on top of RLVR.<n>For perception, PeRL-VL introduces a VLM-based description reward that scores the model's self-generated image descriptions for faithfulness and sufficiency.<n>For reasoning, PeRL-VL adds a text-only Reasoning SFT stage on logic-rich chain-of-thought data, enhancing coherence and logical consistency independently of vision.
arXiv Detail & Related papers (2025-12-13T23:06:18Z)
Enhancing Radiology Report Generation and Visual Grounding using Reinforcement Learning [15.894854593567963]
Reinforcement learning can incorporate task-specific feedback, and its combination with explicit intermediate reasoning ("thinking") has demonstrated substantial gains on verifiable math and coding tasks.<n>We build an updated vision-language model based on Qwen3-VL, followed by a cold-start SFT stage that equips the model with basic thinking ability.<n>We find that while strong SFT remains crucial for high base performance, RL provides additional gains on both tasks, whereas explicit thinking does not appear to further improve results.
arXiv Detail & Related papers (2025-12-11T14:36:14Z)
RL makes MLLMs see better than SFT [96.508432109136]
We conduct a critical yet under-explored analysis of the vision encoder of Multimodal Language Model (MLLM)<n>Our results demonstrate that MLLM's post-training strategy (i.e., SFT or RL) not only leads to distinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM's underlying visual representations.<n>We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT)
arXiv Detail & Related papers (2025-10-18T03:37:17Z)
RARL: Improving Medical VLM Reasoning and Generalization with Reinforcement Learning and LoRA under Data and Hardware Constraints [0.0]
Reasoning-Aware Reinforcement Learning framework enhances the reasoning capabilities of medical vision-language models.<n>Our approach fine-tunes a lightweight base model, Qwen2-VL-2B-Instruct, using Low-Rank Adaptation and custom reward functions.<n> Experimental results show RARL significantly improves VLM performance in medical image analysis and clinical reasoning.
arXiv Detail & Related papers (2025-06-07T00:26:23Z)
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z)
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models [39.551767637896404]
This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs)<n>We show that SFT can significantly undermine subsequent RL by inducing pseudo reasoning paths'' imitated from expert models.<n>We introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs.
arXiv Detail & Related papers (2025-04-10T16:54:05Z)
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z)
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles [91.88062410741833]
We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning.<n>We show that OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z)
Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning [26.835266813794316]
We first propose CLS-RL for MLLM image classification, using verifiable rewards for fine-tuning.<n>We then rethink and question whether explicit thinking in RFT is always necessary.<n>No-Thinking-RL explores RFT without thinking by introducing a simple equality accuracy reward.
arXiv Detail & Related papers (2025-03-20T14:37:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.