Related papers: Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation

Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation

URL: http://arxiv.org/abs/2506.15068v1
Date: Wed, 18 Jun 2025 02:16:53 GMT
Title: Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation
Authors: Zongxia Li, Yapei Chang, Yuhang Zhou, Xiyang Wu, Zichao Liang, Yoo Yeon Sung, Jordan Lee Boyd-Graber,
Abstract summary: We propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO.<n>PrefBERT offers better semantic reward feedback than traditional metrics ROUGE-L and BERTScore do.<n>Human evaluations confirm that using PrefBERT as the reward signal to train policy models yields responses better aligned with human preferences.
Score: 3.727285983486079
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating open-ended long-form generation is challenging because it is hard to define what clearly separates good from bad outputs. Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem. To address this gap, we propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO and guiding its training with distinct rewards for good and bad outputs. Trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality, PrefBERT effectively supports GRPO by offering better semantic reward feedback than traditional metrics ROUGE-L and BERTScore do. Through comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, we show that PrefBERT, trained on multi-sentence and paragraph-length responses, remains reliable across varied long passages and aligns well with the verifiable rewards GRPO needs. Human evaluations confirm that using PrefBERT as the reward signal to train policy models yields responses better aligned with human preferences than those trained with traditional metrics. Our code is available at https://github.com/zli12321/long_form_rl.

Related papers

Intra-Trajectory Consistency for Reward Modeling [67.84522106537274]
We develop an intra-trajectory consistency regularization to enforce that adjacent processes with higher next-token generation probability maintain more consistent rewards.<n>We show that the reward model trained with the proposed regularization induces better DPO-aligned policies and achieves better best-of-N (BON) inference-time verification results.
arXiv Detail & Related papers (2025-06-10T12:59:14Z)
RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning [64.46921169261852]
RAG-Zeval is a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task.<n>Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments.<n>Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments.
arXiv Detail & Related papers (2025-05-28T14:55:33Z)
Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation [77.10390725623125]
Long-form question answering (LFQA) presents unique challenges for large language models.<n>RioRAG is a novel reinforcement learning framework that advances long-form RAG through reinforced informativeness optimization.
arXiv Detail & Related papers (2025-05-27T07:34:41Z)
Bias Fitting to Mitigate Length Bias of Reward Model in RLHF [81.44256822500257]
Reinforcement Learning from Human Feedback relies on reward models to align large language models with human preferences.<n>We propose FiMi-RM, a framework that autonomously learns and corrects underlying bias patterns.<n> Experimental results demonstrate that FiMi-RM achieves a more balanced length-reward distribution.
arXiv Detail & Related papers (2025-05-19T08:29:28Z)
REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models [8.587685197004097]
REINFORCE++ is a novel approach that removes the critic model while using the normalized reward of a batch as the baseline.<n>It exhibits robust performance across various reward models without requiring prompt set truncation.<n>It achieves superior generalization in both RLHF and long chain-of-thought settings compared to existing REINFORCE-based methods.
arXiv Detail & Related papers (2025-01-04T02:08:06Z)
ReFINE: A Reward-Based Framework for Interpretable and Nuanced Evaluation of Radiology Report Generation [39.542375803362965]
ReFINE is an automatic evaluation metric designed specifically for radiology report generation (R2Gen)<n>It scores reports according to user-specified criteria and provides detailed sub-scores, enhancing interpretability.<n>Our experiments demonstrate ReFINE's heightened correlation with human judgments and superior performance in model selection compared to traditional metrics.
arXiv Detail & Related papers (2024-11-26T10:48:55Z)
Post-hoc Reward Calibration: A Case Study on Length Bias [28.266675778940133]
Reward models (RMs) can develop biases by exploiting spurious correlations in their training data. These biases can lead to incorrect output rankings, sub-optimal model evaluations, and the amplification of undesirable behaviours. This paper addresses the challenge of correcting such biases without additional data and training.
arXiv Detail & Related papers (2024-09-25T22:30:42Z)
MaFeRw: Query Rewriting with Multi-Aspect Feedbacks for Retrieval-Augmented Large Language Models [22.50450558103786]
In a real-world RAG system, the current query often involves spoken ellipses and ambiguous references from dialogue contexts.<n>We propose a novel query rewriting method MaFeRw, which improves RAG performance by integrating multi-aspect feedback from both the retrieval process and generated results.<n> Experimental results on two conversational RAG datasets demonstrate that MaFeRw achieves superior generation metrics and more stable training compared to baselines.
arXiv Detail & Related papers (2024-08-30T07:57:30Z)
ODIN: Disentangled Reward Mitigates Hacking in RLHF [127.35607931337019]
We study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback. A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores. Our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
arXiv Detail & Related papers (2024-02-11T22:40:12Z)
Dialogue Response Ranking Training with Large-Scale Human Feedback Data [52.12342165926226]
We leverage social media feedback data to build a large-scale training dataset for feedback prediction. We trained DialogRPT, a set of GPT-2 based models on 133M pairs of human feedback data. Our ranker outperforms the conventional dialog perplexity baseline with a large margin on predicting Reddit feedback.
arXiv Detail & Related papers (2020-09-15T10:50:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.