Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
- URL: http://arxiv.org/abs/2407.19594v2
- Date: Tue, 30 Jul 2024 01:38:06 GMT
- Title: Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
- Authors: Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar,
- Abstract summary: Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains.
Recent self-rewarding mechanisms have shown that LLMs can improve by judging their own responses instead of relying on human labelers.
We introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills.
- Score: 77.9094410773789
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024) have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model's ability to judge {\em and} follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to 29.1% on Arena-Hard. These results strongly suggest the potential for self-improving models without human supervision.
Related papers
- Self-Judge: Selective Instruction Following with Alignment Self-Evaluation [27.69410513313001]
We study the study of selective instruction following, whereby the system declines to execute instructions if the anticipated response quality is low.
We introduce Self-J, a novel self-training framework for developing judge models without needing human-annotated quality scores.
arXiv Detail & Related papers (2024-09-02T04:14:13Z) - Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only.
Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z) - Teaching Language Models to Self-Improve by Learning from Language Feedback [40.649677201161744]
We present Self-Refinement Tuning (SRT), a method that leverages model feedback for alignment.
SRT uses a base language model (e.g., Tulu2) to generate initial responses, which are critiqued and refined by a more advanced model.
SRT further optimize the model by learning from its self-generated feedback and refinements, creating a feedback loop that promotes model improvement.
arXiv Detail & Related papers (2024-06-11T11:20:05Z) - Self-Rewarding Language Models [105.6830788170348]
We study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training.
We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself.
arXiv Detail & Related papers (2024-01-18T14:43:47Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - N-Critics: Self-Refinement of Large Language Models with Ensemble of
Critics [5.516095889257118]
We propose a self-correction mechanism for Large Language Models (LLMs) to mitigate issues such as toxicity and fact hallucination.
This method involves refining model outputs through an ensemble of critics and the model's own feedback.
arXiv Detail & Related papers (2023-10-28T11:22:22Z) - SELF: Self-Evolution with Language Feedback [68.6673019284853]
'SELF' (Self-Evolution with Language Feedback) is a novel approach to advance large language models.
It enables LLMs to self-improve through self-reflection, akin to human learning processes.
Our experiments in mathematics and general tasks demonstrate that SELF can enhance the capabilities of LLMs without human intervention.
arXiv Detail & Related papers (2023-10-01T00:52:24Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.