Related papers: Optimizing Long-Form Clinical Text Generation with Claim-Based Rewards

Optimizing Long-Form Clinical Text Generation with Claim-Based Rewards

URL: http://arxiv.org/abs/2510.02338v1
Date: Fri, 26 Sep 2025 17:53:08 GMT
Title: Optimizing Long-Form Clinical Text Generation with Claim-Based Rewards
Authors: Samyak Jhaveri, Praphul Singh, Jangwon Kim, Tara Taghavi, Krishnaram Kenthapadi,
Abstract summary: We present an evaluation-integrated reinforcement learning framework for long-form clinical text generation.<n>Our method directly optimize factual grounding and completeness without training a separate reward model or relying on human-authored references.<n>The framework is scalable to real-world settings and can incorporate custom objectives such as guideline adherence or billing preferences.
Score: 9.525090594500577
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Automating clinical documentation with large language models requires precise alignment with priorities such as completeness and factual grounding. We present an evaluation-integrated reinforcement learning framework for long-form clinical text generation that couples Group Relative Policy Optimization (GRPO) with DocLens, a claim-level evaluator that provides deterministic, dialogue-grounded rewards. Our method directly optimizes factual grounding and completeness without training a separate reward model or relying on human-authored references. Empirically, the approach improves clinical note quality and reduces training cost via a simple reward-gating strategy. An independent GPT-5 qualitative evaluation further supports these gains, showing higher preference for GRPO outputs in factuality, completeness, and brevity, with fewer omissions and hallucinations. Because the benchmarks are relatively clean and the base model already well aligned, these improvements likely represent a conservative lower bound. The framework is scalable to real-world settings and can incorporate custom objectives such as guideline adherence or billing preferences.

Related papers

ClinCoT: Clinical-Aware Visual Chain-of-Thought for Medical Vision Language Models [24.19721015692576]
We propose ClinCoT to transform preference optimization from response-level correction to visual-driven reasoning.<n>We show that ClinCoT consistently improves factual grounding and achieves superior performance compared with existing preference-based alignment methods.
arXiv Detail & Related papers (2026-03-01T14:15:54Z)
CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria [48.70940362676624]
We propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method.<n>Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks.
arXiv Detail & Related papers (2026-01-28T07:46:13Z)
MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z)
Clinical-R1: Empowering Large Language Models for Faithful and Comprehensive Reasoning with Clinical Objective Relative Policy Optimization [28.610758740650407]
We introduce Clinical-Objective Relative Policy Optimization (CRPO), a scalable, multi-objective, verifiable reinforcement learning method.<n>CRPO integrates rule-based and verifiable reward signals that jointly optimize accuracy, faithfulness, and comprehensiveness without relying on human annotation.
arXiv Detail & Related papers (2025-11-29T19:09:24Z)
Random Direct Preference Optimization for Radiography Report Generation [3.5915338392912344]
Radiography Report Generation (RRG) has gained significant attention in medical image analysis.<n>Existing methods have yet to achieve the quality required for deployment in real-world clinical settings.<n>We introduce a model-agnostic framework to enhance RRG accuracy using Direct Preference Optimization (DPO)
arXiv Detail & Related papers (2025-09-19T10:53:45Z)
RPRO:Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning [5.6813794530075725]
Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference.<n>We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that combines reinforcement learning with preference-driven reasoning refinement.
arXiv Detail & Related papers (2025-08-31T19:38:25Z)
Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards [11.149294285483782]
We propose a unified RLVR-based training paradigm that bridges the gap between non-verifiable tasks and verifiable rewards.<n>We introduce a writing-principle-based pairwise Generative Reward Model (GenRM) and a novel Bootstrapped Relative Policy Optimization (BRPO) algorithm.<n>Our approach empowers LLMs to develop robust writing capabilities without supervised fine-tuning.
arXiv Detail & Related papers (2025-05-30T14:34:57Z)
In-context Ranking Preference Optimization [65.5489745857577]
We propose an In-context Ranking Preference Optimization (IRPO) framework to optimize large language models (LLMs) based on ranking lists constructed during inference.<n>We show IRPO outperforms standard DPO approaches in ranking performance, highlighting its effectiveness in aligning LLMs with direct in-context ranking preferences.
arXiv Detail & Related papers (2025-04-21T23:06:12Z)
AIR: A Systematic Analysis of Annotations, Instructions, and Response Pairs in Preference Dataset [89.37514696019484]
Preference learning is critical for aligning large language models with human values.<n>Our work shifts preference dataset design from ad hoc scaling to component-aware optimization.
arXiv Detail & Related papers (2025-04-04T17:33:07Z)
A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.<n>Their alignment with human values remains critical for ensuring helpful and harmless deployments.<n>Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z)
Systematic Reward Gap Optimization for Mitigating VLM Hallucinations [34.71750379630014]
We introduce Topic-level Preference Rewriting (TPR), a novel framework designed for the systematic optimization of reward gap configuration.<n>TPR provides topic-level control over fine-grained semantic details, enabling advanced data curation strategies.<n>It significantly reduces hallucinations by up to 93% on ObjectHal-Bench, and also exhibits superior data efficiency towards robust and cost-effective VLM alignment.
arXiv Detail & Related papers (2024-11-26T09:42:07Z)
Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models [54.381650481255235]
We introduce a new tuning-free approach for self-alignment, Dynamic Rewarding with Prompt Optimization (O) Our approach leverages a search-based optimization framework that allows LLMs to iteratively self-improve and craft the optimal alignment instructions. Empirical evaluations on eight recent LLMs, both open and closed-sourced, demonstrate that DRPO significantly enhances alignment performance.
arXiv Detail & Related papers (2024-11-13T16:15:38Z)
Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.