Related papers: Self-Generated Critiques Boost Reward Modeling for Language Models

Self-Generated Critiques Boost Reward Modeling for Language Models

URL: http://arxiv.org/abs/2411.16646v1
Date: Mon, 25 Nov 2024 18:28:26 GMT
Title: Self-Generated Critiques Boost Reward Modeling for Language Models
Authors: Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, Rui Hou,
Abstract summary: Critic-RM is a framework that improves reward models using self-generated critiques without extra supervision. Experiments show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges.
Score: 57.60881438647227
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Reward modeling is crucial for aligning large language models (LLMs) with human preferences, especially in reinforcement learning from human feedback (RLHF). However, current reward models mainly produce scalar scores and struggle to incorporate critiques in a natural language format. We hypothesize that predicting both critiques and the scalar reward would improve reward modeling ability. Motivated by this, we propose Critic-RM, a framework that improves reward models using self-generated critiques without extra supervision. Critic-RM employs a two-stage process: generating and filtering high-quality critiques, followed by joint fine-tuning on reward prediction and critique generation. Experiments across benchmarks show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges, demonstrating strong performance and data efficiency. Additional studies further validate the effectiveness of generated critiques in rectifying flawed reasoning steps with 2.5%-3.2% gains in improving reasoning accuracy.

Related papers

Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning [89.60378227969643]
We propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision.<n>Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly.<n>Experiments across various tasks and models show that Critique-RL delivers substantial performance improvements.
arXiv Detail & Related papers (2025-10-28T11:37:01Z)
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model [99.71684530652942]
We show that LLaVA-Critic-R1 emerges as a top-performing critic but also as a competitive policy model.<n>Applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks.<n>Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation.
arXiv Detail & Related papers (2025-08-31T03:08:02Z)
RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback [57.967762383794806]
RefCritic is a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards.<n>We evaluate RefCritic on Qwen2.5-14B-Instruct and DeepSeek-R1-Distill-Qwen-14B across five benchmarks.
arXiv Detail & Related papers (2025-07-20T16:19:51Z)
Training Language Model to Critique for Better Refinement [58.73039433159486]
We introduce textbfRefinement-oriented textbfCritique textbfOptimization (RCO), a novel framework designed to train critic models using refinement signals.<n>RCO uses a feedback loop where critiques, generated by the critic model, guide the actor model in refining its responses.<n>By focusing on critiques that lead to better refinements, RCO eliminates the need for direct critique preference assessment.
arXiv Detail & Related papers (2025-06-27T12:10:57Z)
ReasonGRM: Enhancing Generative Reward Models through Large Reasoning Models [9.30148520355391]
We present ReasonGRM, a three-stage generative reward modeling framework.<n>In the first stage, Zero-RL is used to generate concise, outcome-directed reasoning paths.<n>In the second stage, $Rstar$, which scores reasoning paths based on their generation likelihood.<n>In the final stage, the model is further refined through reinforcement learning to enhance its preference discrimination capabilities.
arXiv Detail & Related papers (2025-06-20T03:10:52Z)
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback [59.078756231841574]
Critique-GRPO is an online RL framework that integrates both natural language and numerical feedback for effective policy optimization.<n>We show Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks.
arXiv Detail & Related papers (2025-06-03T17:39:02Z)
Rethinking Reward Model Evaluation Through the Lens of Reward Overoptimization [15.729285736811383]
Reward models play a crucial role in reinforcement learning from human feedback.<n>Existing benchmarks for reward models show a weak correlation with the performance of optimized policies.
arXiv Detail & Related papers (2025-05-19T06:43:08Z)
SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning [99.645427839457]
Self-Play Critic (SPC) is a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games. SPC involves fine-tuning two copies of a base model to play two roles, namely a "sneaky generator" and a "critic"
arXiv Detail & Related papers (2025-04-27T08:45:06Z)
Teaching Language Models to Critique via Reinforcement Learning [59.36253627145115]
We show that critics trained with $textttCTRL$ significantly enhance pass rates and mitigate errors across both base and stronger generator models.<n>We also show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision.
arXiv Detail & Related papers (2025-02-05T02:18:46Z)
Enabling Scalable Oversight via Self-Evolving Critic [59.861013614500024]
SCRIT (Self-evolving CRITic) is a framework that enables genuine self-evolution of critique abilities. It self-improves by training on synthetic data, generated by a contrastive-based self-critic. It achieves up to a 10.3% improvement on critique-correction and error identification benchmarks.
arXiv Detail & Related papers (2025-01-10T05:51:52Z)
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data. We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z)
CARMO: Dynamic Criteria Generation for Context-Aware Reward Modelling [27.86204841898399]
Reward modeling in large language models is susceptible to reward hacking. We propose Context-Aware Reward Modeling (CARMO) to mitigate this problem. We establish a new state-of-the-art performance in zero-shot settings for generative models, achieving a 2.1% improvement on Reward Bench.
arXiv Detail & Related papers (2024-10-28T21:18:49Z)
Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment [50.21842377409232]
Despite vital role reward models play in alignment, previous works have consistently overlooked their performance. This work first investigates the quality of the widely-used preference dataset, HH-RLHF, and curates a clean version, CHH-RLHF. Based on CHH-RLHF, we benchmark the accuracy of a broad range of reward models used in previous alignment works, unveiling the unreliability of using them both for optimization and evaluation.
arXiv Detail & Related papers (2024-09-26T04:28:35Z)
Direct Judgement Preference Optimization [66.83088028268318]
We train large language models (LLMs) as generative judges to evaluate and critique other models' outputs. We employ three approaches to collect the preference pairs for different use cases, each aimed at improving our generative judge from a different perspective. Our model robustly counters inherent biases such as position and length bias, flexibly adapts to any evaluation protocol specified by practitioners, and provides helpful language feedback for improving downstream generator models.
arXiv Detail & Related papers (2024-09-23T02:08:20Z)
Critique-out-Loud Reward Models [20.631830494414096]
We introduce Critique-out-Loud (CLoud) reward models. CLoud reward models operate by first generating a natural language critique of the assistant's response. We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B base models.
arXiv Detail & Related papers (2024-08-21T17:24:15Z)
Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only. Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z)
Improving Reward Models with Synthetic Critiques [20.180933963110814]
Reward models (RMs) play a critical role in aligning language models through the process of reinforcement learning from human feedback. We propose a novel approach using synthetic natural language critiques generated by large language models to provide additional feedback. We demonstrate that high-quality critiques improve the performance and data efficiency of RMs from different pretrained models.
arXiv Detail & Related papers (2024-05-31T14:33:07Z)
Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset. We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z)
Towards Reliable and Fluent Large Language Models: Incorporating Feedback Learning Loops in QA Systems [10.58737969057445]
We build a dataset to train a critic model capable of evaluating the citation, correctness, and fluency of responses generated by large language models. We propose an automated feedback mechanism that leverages the critic model to offer real-time feedback on heterogeneous aspects of generated text. Experimental results demonstrate the efficacy of our approach, including a 4% precision increase in citation and an approximately 8% enhancement in the MAUVE metric for fluency.
arXiv Detail & Related papers (2023-09-08T09:39:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.