Related papers: Self-rationalization improves LLM as a fine-grained judge

Self-rationalization improves LLM as a fine-grained judge

URL: http://arxiv.org/abs/2410.05495v1
Date: Mon, 7 Oct 2024 21:05:53 GMT
Title: Self-rationalization improves LLM as a fine-grained judge
Authors: Prapti Trivedi, Aditya Gulati, Oliver Molenschot, Meghana Arakkal Rajeev, Rajkumar Ramamurthy, Keith Stevens, Tanveesh Singh Chaudhery, Jahnavi Jambholkar, James Zou, Nazneen Rajani,
Abstract summary: We introduce Self-Rationalization, an iterative process of improving the rationales for the judge models. Self-rationalization works by having the model generate multiple judgments with rationales for the same input. We show that our model learns to produce higher quality rationales, with a win rate of $62%$ on average compared to models just trained via SFT on rationale.
Score: 21.917301609125417
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: LLM-as-a-judge models have been used for evaluating both human and AI generated content, specifically by providing scores and rationales. Rationales, in addition to increasing transparency, help models learn to calibrate its judgments. Enhancing a model's rationale can therefore improve its calibration abilities and ultimately the ability to score content. We introduce Self-Rationalization, an iterative process of improving the rationales for the judge models, which consequently improves the score for fine-grained customizable scoring criteria (i.e., likert-scale scoring with arbitrary evaluation criteria). Self-rationalization works by having the model generate multiple judgments with rationales for the same input, curating a preference pair dataset from its own judgements, and iteratively fine-tuning the judge via DPO. Intuitively, this approach allows the judge model to self-improve by learning from its own rationales, leading to better alignment and evaluation accuracy. After just two iterations -- while only relying on examples in the training set -- human evaluation shows that our judge model learns to produce higher quality rationales, with a win rate of $62\%$ on average compared to models just trained via SFT on rationale . This judge model also achieves high scoring accuracy on BigGen Bench and Reward Bench, outperforming even bigger sized models trained using SFT with rationale, self-consistency or best-of-$N$ sampling by $3\%$ to $9\%$.

Related papers

Learning Ordinal Probabilistic Reward from Preferences [25.069054134899744]
We introduce a novel reward modeling paradigm: Probabilistic Reward Model (PRM)<n>Instead of modeling reward as a deterministic scalar, our approach treats it as a random variable, learning a full probability distribution for the quality of each response.<n>Building on OPRM, we propose a data-efficient training strategy called Region Flooding Tuning (RgFT)
arXiv Detail & Related papers (2026-02-13T06:43:02Z)
Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge [17.40713507922006]
Large language models (LLMs) can serve as judges that offer rapid and reliable assessments of other outputs.<n>LLMs may systematically assign overly favorable ratings to their own outputs, a phenomenon known as self-bias.<n>We present a statistical framework that explicitly formalizes assumptions under which self-bias can be identified and estimated.
arXiv Detail & Related papers (2025-08-08T21:22:12Z)
CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards [72.44810390478229]
CompassJudger-2 is a novel generalist judge model that overcomes limitations via a task-driven, multi-domain data curation strategy.<n> CompassJudger-2 achieves superior results across multiple judge and reward benchmarks.
arXiv Detail & Related papers (2025-07-12T01:34:24Z)
J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization [69.23273504123941]
We train judges to be robust to positional biases that arise in more complex evaluation settings.<n>We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work.<n>We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%.
arXiv Detail & Related papers (2025-05-19T16:50:35Z)
SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning [99.645427839457]
Self-Play Critic (SPC) is a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games. SPC involves fine-tuning two copies of a base model to play two roles, namely a "sneaky generator" and a "critic"
arXiv Detail & Related papers (2025-04-27T08:45:06Z)
Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models [68.92020689188887]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs) Existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models.
arXiv Detail & Related papers (2025-02-26T04:50:43Z)
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data [14.95829896035971]
An emerging family of debiasing tools promises to fix issues by using a few high quality labels to debias a large number of model judgments. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half.
arXiv Detail & Related papers (2024-10-17T08:49:42Z)
Direct Judgement Preference Optimization [66.83088028268318]
We train large language models (LLMs) as generative judges to evaluate and critique other models' outputs. We employ three approaches to collect the preference pairs for different use cases, each aimed at improving our generative judge from a different perspective. Our model robustly counters inherent biases such as position and length bias, flexibly adapts to any evaluation protocol specified by practitioners, and provides helpful language feedback for improving downstream generator models.
arXiv Detail & Related papers (2024-09-23T02:08:20Z)
Self-Judge: Selective Instruction Following with Alignment Self-Evaluation [27.69410513313001]
We study the study of selective instruction following, whereby the system declines to execute instructions if the anticipated response quality is low. We introduce Self-J, a novel self-training framework for developing judge models without needing human-annotated quality scores.
arXiv Detail & Related papers (2024-09-02T04:14:13Z)
Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only. Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z)
Improving Language Model Reasoning with Self-motivated Learning [60.779625789039486]
textitSelf-motivated Learning framework motivates the model itself to automatically generate rationales on existing datasets. We train a reward model with the rank to evaluate the quality of rationales, and improve the performance of reasoning through reinforcement learning.
arXiv Detail & Related papers (2024-04-10T14:05:44Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
Leveraging Uncertainty Estimates To Improve Classifier Performance [4.4951754159063295]
Binary classification involves predicting the label of an instance based on whether the model score for the positive class exceeds a threshold chosen based on the application requirements. However, model scores are often not aligned with the true positivity rate. This is especially true when the training involves a differential sampling across classes or there is distributional drift between train and test settings.
arXiv Detail & Related papers (2023-11-20T12:40:25Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.