Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports
- URL: http://arxiv.org/abs/2510.23217v1
- Date: Mon, 27 Oct 2025 11:08:05 GMT
- Title: Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports
- Authors: Alois Thomas, Maya Varma, Jean-Benoit Delbrouck, Curtis P. Langlotz,
- Abstract summary: We introduce a sentence-level Reward Model (PRM) adapted for this vision-language task.<n>PRM predicts the factual correctness of each generated sentence conditioned on clinical context.<n>PRM scores effectively filter low-quality reports, improving F1-CheXbert scores by 4.5%.
- Score: 12.808813933646407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automating radiology report generation with Large Vision-Language Models (LVLMs) holds great potential, yet these models often produce clinically critical hallucinations, posing serious risks. Existing hallucination detection methods frequently lack the necessary sentence-level granularity or robust generalization across different LVLM generators. We introduce a novel approach: a sentence-level Process Reward Model (PRM) adapted for this vision-language task. Our PRM predicts the factual correctness of each generated sentence, conditioned on clinical context and preceding text. When fine-tuned on MIMIC-CXR with weakly-supervised labels, a lightweight 0.5B-parameter PRM outperforms existing verification techniques, demonstrating, for instance, relative improvements of 7.5% in Matthews Correlation Coefficient and 1.8% in AUROC over strong white-box baselines on outputs from one LVLM. Unlike methods reliant on internal model states, our PRM demonstrates strong generalization to an unseen LVLM. We further show its practical utility: PRM scores effectively filter low-quality reports, improving F1-CheXbert scores by 4.5% (when discarding the worst 10% of reports). Moreover, when guiding a novel weighted best-of-N selection process on the MIMIC-CXR test set, our PRM show relative improvements in clinical metrics of 7.4% for F1-CheXbert and 0.6% for BERTScore. These results demonstrate that a lightweight, context-aware PRM provides a model-agnostic safety layer for clinical LVLMs without access to internal activations
Related papers
- Rethinking the Efficiency and Effectiveness of Reinforcement Learning for Radiology Report Generation [43.67582796047454]
We discuss the impact of data quantity and quality on the performance ofReinforcement learning (RL) in medical contexts.<n>We propose a diagnostic diversity-based data sampling strategy that enables comparable performance with fewer samples.<n>We introduce Diagnostic Token-weighted Policy Optimization (DiTPO), which directly optimize for clinical accuracy by using a diagnostic F1 score as the reward signal.
arXiv Detail & Related papers (2026-03-04T12:57:05Z) - Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering [94.37535002230504]
We develop a training-free, inference-time control framework termed Semantically Decoupled Latent Steering.<n>Our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition.<n>We show that our approach significantly reduces the probability of historical hallucinations.
arXiv Detail & Related papers (2026-02-27T04:49:01Z) - LRMR: LLM-Driven Relational Multi-node Ranking for Lymph Node Metastasis Assessment in Rectal Cancer [12.795639054336226]
preoperative assessment of lymph node metastasis in rectal cancer guides treatment decisions.<n>Some artificial intelligence models operate as black boxes, lacking the interpretability needed for clinical trust.<n>We introduce LRMR, an LLM-Driven Multi-node Ranking framework.
arXiv Detail & Related papers (2025-07-15T16:29:45Z) - ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs [75.72672339168092]
We introduce ReasonFlux-PRM, a novel trajectory-aware PRM to evaluate trajectory-response type of reasoning traces.<n>ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data.<n>Our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
arXiv Detail & Related papers (2025-06-23T17:59:02Z) - Structuring Radiology Reports: Challenging LLMs with Lightweight Models [5.01440254761063]
Large language models (LLMs) have shown strong capabilities in reformatting clinical text, their high computational requirements, lack of transparency, and data privacy concerns hinder practical deployment.<n>We explore lightweight encoder-decoder models (300M parameters)-specifically T5 and BERT2BERT-for structuring radiology reports from the MIMIC-CXR and CheXpert Plus datasets.<n>Our best-performing lightweight model outperforms all LLMs adapted using prompt-based techniques on a human-annotated test set.
arXiv Detail & Related papers (2025-05-30T20:12:51Z) - Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation [2.821158017021184]
Look & Mark (L&M) is a novel grounding fixation strategy that integrates radiologist eye fixations (Look) and bounding box annotations (Mark)<n>General-purpose models also benefit from L&M combined with in-context learning, with LLaVA-OV achieving an 87.3% clinical average performance (C.AVG)-the highest among all models.
arXiv Detail & Related papers (2025-05-28T10:54:40Z) - ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification [57.22053411719822]
ChestX-Reasoner is a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports.<n>Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards.
arXiv Detail & Related papers (2025-04-29T16:48:23Z) - Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters [16.74673750576054]
We evaluated whether openly available large-language models (LLMs) can automate concept extraction from computed-tomography PE (CTPE) reports without sacrificing data quality.<n>LLMs offer a scalable, accurate solution for PE registry abstraction, and a dual-model review workflow can further safeguard data quality with minimal human oversight.
arXiv Detail & Related papers (2025-03-26T21:38:06Z) - Process-Supervised Reward Models for Verifying Clinical Note Generation: A Scalable Approach Guided by Domain Expertise [14.052630186550628]
Process-supervised reward models (PRMs) excel at providing step-by-step verification for large language model (LLM) outputs in domains like mathematics and coding.<n>We introduce a novel framework for training PRMs to deliver step-level reward signals for LLM-generated clinical notes.
arXiv Detail & Related papers (2024-12-17T06:24:34Z) - EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation [58.546205554954454]
We propose Enhancing Alignment in MLLMs via Critical Observation (EACO)<n>EACO aligns MLLMs by self-generated preference data using only 5k images economically.<n>EACO reduces the overall hallucinations by 65.6% on HallusionBench and improves the reasoning ability by 21.8% on MME-Cognition.
arXiv Detail & Related papers (2024-12-06T09:59:47Z) - Provable Risk-Sensitive Distributional Reinforcement Learning with
General Function Approximation [54.61816424792866]
We introduce a general framework on Risk-Sensitive Distributional Reinforcement Learning (RS-DisRL), with static Lipschitz Risk Measures (LRM) and general function approximation.
We design two innovative meta-algorithms: textttRS-DisRL-M, a model-based strategy for model-based function approximation, and textttRS-DisRL-V, a model-free approach for general value function approximation.
arXiv Detail & Related papers (2024-02-28T08:43:18Z) - Advancing Radiograph Representation Learning with Masked Record Modeling [52.04899592688968]
We formulate the self- and report-completion as two complementary objectives and present a unified framework based on masked record modeling (MRM)
MRM reconstructs masked image patches and masked report tokens following a multi-task scheme to learn knowledge-enhanced semantic representations.
Specifically, we find that MRM offers superior performance in label-efficient fine-tuning.
arXiv Detail & Related papers (2023-01-30T18:33:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.