Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients
- URL: http://arxiv.org/abs/2512.23090v2
- Date: Fri, 02 Jan 2026 18:25:09 GMT
- Title: Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients
- Authors: Armin Berger, Manuela Bergau, Helen Schneider, Saad Ahmad, Tom Anglim Lagones, Gianluca Brugnara, Martha Foltyn-Dumitru, Kai Schlamp, Philipp Vollmuth, Rafet Sifa,
- Abstract summary: We introduce ChexReason, a vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU.<n> Evaluations on CheXpert and NIH benchmarks reveal a fundamental tension: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH)<n>We identify a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization, indicating teacher-guided reasoning captures more institution-agnostic features.
- Score: 2.377303603725137
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent Reinforcement Learning (RL) advances for Large Language Models (LLMs) have improved reasoning tasks, yet their resource-constrained application to medical imaging remains underexplored. We introduce ChexReason, a vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU. Evaluations on CheXpert and NIH benchmarks reveal a fundamental tension: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH). This mirrors high-resource models like NV-Reason-CXR-3B, suggesting the issue stems from the RL paradigm rather than scale. We identify a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization, indicating teacher-guided reasoning captures more institution-agnostic features. Furthermore, cross-model comparisons show structured reasoning scaffolds benefit general-purpose VLMs but offer minimal gain for medically pre-trained models. Consequently, curated supervised fine-tuning may outperform aggressive RL for clinical deployment requiring robustness across diverse populations.
Related papers
- Rethinking the Efficiency and Effectiveness of Reinforcement Learning for Radiology Report Generation [43.67582796047454]
We discuss the impact of data quantity and quality on the performance ofReinforcement learning (RL) in medical contexts.<n>We propose a diagnostic diversity-based data sampling strategy that enables comparable performance with fewer samples.<n>We introduce Diagnostic Token-weighted Policy Optimization (DiTPO), which directly optimize for clinical accuracy by using a diagnostic F1 score as the reward signal.
arXiv Detail & Related papers (2026-03-04T12:57:05Z) - When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains [1.9256950761509062]
Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs)<n>It remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT)<n>We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL.
arXiv Detail & Related papers (2026-03-01T22:16:19Z) - CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z) - Automated Identification of Incidentalomas Requiring Follow-Up: A Multi-Anatomy Evaluation of LLM-Based and Supervised Approaches [5.958100741754613]
We evaluated large language models (LLMs) against supervised baselines for fine-grained, lesion-level detection of incidentalomas.<n>We introduced a novel inference strategy using lesion-tagged inputs and anatomy-aware prompting to ground model reasoning.<n>The anatomy-informed GPT-OSS-20b model achieved the highest performance, yielding an incidentaloma-positive macro-F1 of 0.79.
arXiv Detail & Related papers (2025-12-05T08:49:57Z) - Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data? [82.09573568241724]
EssenceBench is a coarse-to-fine framework utilizing an iterative Genetic Algorithm (GA)<n>Our approach yields superior compression results with lower reconstruction error and markedly higher efficiency.<n>On the HellaSwag benchmark (10K samples), our method preserves the ranking of all models shifting within 5% using 25x fewer samples, and achieves 95% ranking preservation shifting within 5% using only 200x fewer samples.
arXiv Detail & Related papers (2025-10-12T05:38:10Z) - BroRL: Scaling Reinforcement Learning via Broadened Exploration [88.69554867685243]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key ingredient for unlocking complex reasoning capabilities in large language models.<n>Recent work ProRL has shown promise in scaling RL by increasing the number of training steps.<n>We investigate a complementary paradigm for scaling RL, BroR-Lincreasing the number of rollouts per example to hundreds.
arXiv Detail & Related papers (2025-10-01T17:59:02Z) - Predicting Diabetic Retinopathy Using a Two-Level Ensemble Model [0.6445605125467574]
Diabetic retinopathy is a leading cause of blindness in working-age adults.<n>Image-based AI tools have shown limitations in early-stage detection.<n>We propose a non-image-based, two-level ensemble model for DR prediction using routine laboratory test results.
arXiv Detail & Related papers (2025-10-01T16:19:57Z) - Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation [2.821158017021184]
Look & Mark (L&M) is a novel grounding fixation strategy that integrates radiologist eye fixations (Look) and bounding box annotations (Mark)<n>General-purpose models also benefit from L&M combined with in-context learning, with LLaVA-OV achieving an 87.3% clinical average performance (C.AVG)-the highest among all models.
arXiv Detail & Related papers (2025-05-28T10:54:40Z) - AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z) - ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification [57.22053411719822]
ChestX-Reasoner is a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports.<n>Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards.
arXiv Detail & Related papers (2025-04-29T16:48:23Z) - Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model [47.108822717757945]
We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training on the base model.<n>We demonstrate that PPO with GAE and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both benchmark performance and response length.
arXiv Detail & Related papers (2025-03-31T16:36:05Z) - Understanding R1-Zero-Like Training: A Critical Perspective [73.25430192337235]
We critically examine R1-Zero-like training by analyzing its two core components: base models and RL.<n>We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance.<n>We present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model.
arXiv Detail & Related papers (2025-03-26T17:59:14Z) - Training-free Ultra Small Model for Universal Sparse Reconstruction in Compressed Sensing [39.36305648162564]
This paper proposes ultra-small artificial neural models called coefficients learning (CL)<n>CL enables training-free and rapid sparse reconstruction while inheriting the generality and interpretability of traditional iterative methods.<n>Compared to representative iterative methods, CLOMP improves efficiency by 100 to 1000 folds for large-scale data.
arXiv Detail & Related papers (2025-01-20T16:50:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.