Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
- URL: http://arxiv.org/abs/2407.19594v2
- Date: Tue, 30 Jul 2024 01:38:06 GMT
- Title: Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
- Authors: Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar,
- Abstract summary: Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains.
Recent self-rewarding mechanisms have shown that LLMs can improve by judging their own responses instead of relying on human labelers.
We introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills.
- Score: 77.9094410773789
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024) have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model's ability to judge {\em and} follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to 29.1% on Arena-Hard. These results strongly suggest the potential for self-improving models without human supervision.
Related papers
- Self-Improving VLM Judges Without Human Annotations [74.29324865147838]
We present a framework to self-train a VLM judge model without any human preference annotations, using only self-synthesized data.<n>Our method improves a Llama-3.2-11B multimodal judge from 0.38 to 0.51 in overall accuracy on Multimodal RewardBench.<n>The overall strength of these human-annotation-free results suggest the potential for a future self-judge that evolves alongside rapidly improving VLM capabilities.
arXiv Detail & Related papers (2025-12-02T20:52:19Z) - Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future [38.1810626252963]
Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting.<n>We propose textbf Self-Rewarding Language Models that strategically coordinate past, present, and future model generations to sustain learning signals.
arXiv Detail & Related papers (2025-08-08T05:25:54Z) - Can Large Reasoning Models Self-Train? [51.0277533541394]
We use majority voting as a simple self-feedback mechanism to study whether self-training can be sustained within reinforcement learning.<n>We find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration.<n>Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking, resulting in sudden and complete performance collapse.
arXiv Detail & Related papers (2025-05-27T17:16:00Z) - Evolving LLMs' Self-Refinement Capability via Iterative Preference Optimization [35.807318314766974]
EVOLVE is a novel framework that integrates preference training and self-refinement data collection.<n>It consistently enhances performance on mathematical reasoning tasks like GSM8K and MATH.
arXiv Detail & Related papers (2025-02-08T15:21:55Z) - Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.
Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.
We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z) - Self-Judge: Selective Instruction Following with Alignment Self-Evaluation [27.69410513313001]
We study the study of selective instruction following, whereby the system declines to execute instructions if the anticipated response quality is low.
We introduce Self-J, a novel self-training framework for developing judge models without needing human-annotated quality scores.
arXiv Detail & Related papers (2024-09-02T04:14:13Z) - Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only.
Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z) - Teaching Language Models to Self-Improve by Learning from Language Feedback [40.649677201161744]
We present Self-Refinement Tuning (SRT), a method that leverages model feedback for alignment.
SRT uses a base language model (e.g., Tulu2) to generate initial responses, which are critiqued and refined by a more advanced model.
SRT further optimize the model by learning from its self-generated feedback and refinements, creating a feedback loop that promotes model improvement.
arXiv Detail & Related papers (2024-06-11T11:20:05Z) - Self-Rewarding Language Models [105.6830788170348]
We study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training.
We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself.
arXiv Detail & Related papers (2024-01-18T14:43:47Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - N-Critics: Self-Refinement of Large Language Models with Ensemble of
Critics [5.516095889257118]
We propose a self-correction mechanism for Large Language Models (LLMs) to mitigate issues such as toxicity and fact hallucination.
This method involves refining model outputs through an ensemble of critics and the model's own feedback.
arXiv Detail & Related papers (2023-10-28T11:22:22Z) - SELF: Self-Evolution with Language Feedback [68.6673019284853]
'SELF' (Self-Evolution with Language Feedback) is a novel approach to advance large language models.
It enables LLMs to self-improve through self-reflection, akin to human learning processes.
Our experiments in mathematics and general tasks demonstrate that SELF can enhance the capabilities of LLMs without human intervention.
arXiv Detail & Related papers (2023-10-01T00:52:24Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.