Related papers: Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports

Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports

URL: http://arxiv.org/abs/2510.23217v1
Date: Mon, 27 Oct 2025 11:08:05 GMT
Title: Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports
Authors: Alois Thomas, Maya Varma, Jean-Benoit Delbrouck, Curtis P. Langlotz,
Abstract summary: We introduce a sentence-level Reward Model (PRM) adapted for this vision-language task.<n>PRM predicts the factual correctness of each generated sentence conditioned on clinical context.<n>PRM scores effectively filter low-quality reports, improving F1-CheXbert scores by 4.5%.
Score: 12.808813933646407
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automating radiology report generation with Large Vision-Language Models (LVLMs) holds great potential, yet these models often produce clinically critical hallucinations, posing serious risks. Existing hallucination detection methods frequently lack the necessary sentence-level granularity or robust generalization across different LVLM generators. We introduce a novel approach: a sentence-level Process Reward Model (PRM) adapted for this vision-language task. Our PRM predicts the factual correctness of each generated sentence, conditioned on clinical context and preceding text. When fine-tuned on MIMIC-CXR with weakly-supervised labels, a lightweight 0.5B-parameter PRM outperforms existing verification techniques, demonstrating, for instance, relative improvements of 7.5% in Matthews Correlation Coefficient and 1.8% in AUROC over strong white-box baselines on outputs from one LVLM. Unlike methods reliant on internal model states, our PRM demonstrates strong generalization to an unseen LVLM. We further show its practical utility: PRM scores effectively filter low-quality reports, improving F1-CheXbert scores by 4.5% (when discarding the worst 10% of reports). Moreover, when guiding a novel weighted best-of-N selection process on the MIMIC-CXR test set, our PRM show relative improvements in clinical metrics of 7.4% for F1-CheXbert and 0.6% for BERTScore. These results demonstrate that a lightweight, context-aware PRM provides a model-agnostic safety layer for clinical LVLMs without access to internal activations

Related papers

Rethinking the Efficiency and Effectiveness of Reinforcement Learning for Radiology Report Generation [43.67582796047454]
We discuss the impact of data quantity and quality on the performance ofReinforcement learning (RL) in medical contexts.<n>We propose a diagnostic diversity-based data sampling strategy that enables comparable performance with fewer samples.<n>We introduce Diagnostic Token-weighted Policy Optimization (DiTPO), which directly optimize for clinical accuracy by using a diagnostic F1 score as the reward signal.
arXiv Detail & Related papers (2026-03-04T12:57:05Z)
Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering [94.37535002230504]
We develop a training-free, inference-time control framework termed Semantically Decoupled Latent Steering.<n>Our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition.<n>We show that our approach significantly reduces the probability of historical hallucinations.
arXiv Detail & Related papers (2026-02-27T04:49:01Z)
LRMR: LLM-Driven Relational Multi-node Ranking for Lymph Node Metastasis Assessment in Rectal Cancer [12.795639054336226]
preoperative assessment of lymph node metastasis in rectal cancer guides treatment decisions.<n>Some artificial intelligence models operate as black boxes, lacking the interpretability needed for clinical trust.<n>We introduce LRMR, an LLM-Driven Multi-node Ranking framework.
arXiv Detail & Related papers (2025-07-15T16:29:45Z)
ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs [75.72672339168092]
We introduce ReasonFlux-PRM, a novel trajectory-aware PRM to evaluate trajectory-response type of reasoning traces.<n>ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data.<n>Our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
arXiv Detail & Related papers (2025-06-23T17:59:02Z)
Structuring Radiology Reports: Challenging LLMs with Lightweight Models [5.01440254761063]
Large language models (LLMs) have shown strong capabilities in reformatting clinical text, their high computational requirements, lack of transparency, and data privacy concerns hinder practical deployment.<n>We explore lightweight encoder-decoder models (300M parameters)-specifically T5 and BERT2BERT-for structuring radiology reports from the MIMIC-CXR and CheXpert Plus datasets.<n>Our best-performing lightweight model outperforms all LLMs adapted using prompt-based techniques on a human-annotated test set.
arXiv Detail & Related papers (2025-05-30T20:12:51Z)
Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation [2.821158017021184]
Look & Mark (L&M) is a novel grounding fixation strategy that integrates radiologist eye fixations (Look) and bounding box annotations (Mark)<n>General-purpose models also benefit from L&M combined with in-context learning, with LLaVA-OV achieving an 87.3% clinical average performance (C.AVG)-the highest among all models.
arXiv Detail & Related papers (2025-05-28T10:54:40Z)
ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification [57.22053411719822]
ChestX-Reasoner is a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports.<n>Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards.
arXiv Detail & Related papers (2025-04-29T16:48:23Z)
Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters [16.74673750576054]
We evaluated whether openly available large-language models (LLMs) can automate concept extraction from computed-tomography PE (CTPE) reports without sacrificing data quality.<n>LLMs offer a scalable, accurate solution for PE registry abstraction, and a dual-model review workflow can further safeguard data quality with minimal human oversight.
arXiv Detail & Related papers (2025-03-26T21:38:06Z)
Process-Supervised Reward Models for Verifying Clinical Note Generation: A Scalable Approach Guided by Domain Expertise [14.052630186550628]
Process-supervised reward models (PRMs) excel at providing step-by-step verification for large language model (LLM) outputs in domains like mathematics and coding.<n>We introduce a novel framework for training PRMs to deliver step-level reward signals for LLM-generated clinical notes.
arXiv Detail & Related papers (2024-12-17T06:24:34Z)
EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation [58.546205554954454]
We propose Enhancing Alignment in MLLMs via Critical Observation (EACO)<n>EACO aligns MLLMs by self-generated preference data using only 5k images economically.<n>EACO reduces the overall hallucinations by 65.6% on HallusionBench and improves the reasoning ability by 21.8% on MME-Cognition.
arXiv Detail & Related papers (2024-12-06T09:59:47Z)
Provable Risk-Sensitive Distributional Reinforcement Learning with General Function Approximation [54.61816424792866]
We introduce a general framework on Risk-Sensitive Distributional Reinforcement Learning (RS-DisRL), with static Lipschitz Risk Measures (LRM) and general function approximation. We design two innovative meta-algorithms: textttRS-DisRL-M, a model-based strategy for model-based function approximation, and textttRS-DisRL-V, a model-free approach for general value function approximation.
arXiv Detail & Related papers (2024-02-28T08:43:18Z)
Advancing Radiograph Representation Learning with Masked Record Modeling [52.04899592688968]
We formulate the self- and report-completion as two complementary objectives and present a unified framework based on masked record modeling (MRM) MRM reconstructs masked image patches and masked report tokens following a multi-task scheme to learn knowledge-enhanced semantic representations. Specifically, we find that MRM offers superior performance in label-efficient fine-tuning.
arXiv Detail & Related papers (2023-01-30T18:33:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.