Related papers: Teaching Language Models to Self-Improve by Learning from Language Feedback

Teaching Language Models to Self-Improve by Learning from Language Feedback

URL: http://arxiv.org/abs/2406.07168v1
Date: Tue, 11 Jun 2024 11:20:05 GMT
Title: Teaching Language Models to Self-Improve by Learning from Language Feedback
Authors: Chi Hu, Yimin Hu, Hang Cao, Tong Xiao, Jingbo Zhu,
Abstract summary: We present Self-Refinement Tuning (SRT), a method that leverages model feedback for alignment. SRT uses a base language model (e.g., Tulu2) to generate initial responses, which are critiqued and refined by a more advanced model. SRT further optimize the model by learning from its self-generated feedback and refinements, creating a feedback loop that promotes model improvement.
Score: 40.649677201161744
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Aligning Large Language Models (LLMs) with human intentions and values is crucial yet challenging. Current methods primarily rely on human preferences, which are costly and insufficient in capturing nuanced feedback expressed in natural language. In this paper, we present Self-Refinement Tuning (SRT), a method that leverages model feedback for alignment, thereby reducing reliance on human annotations. SRT uses a base language model (e.g., Tulu2) to generate initial responses, which are critiqued and refined by a more advanced model (e.g., GPT-4-Turbo). This process enables the base model to self-evaluate and improve its outputs, facilitating continuous learning. SRT further optimizes the model by learning from its self-generated feedback and refinements, creating a feedback loop that promotes model improvement. Our empirical evaluations demonstrate that SRT significantly outperforms strong baselines across diverse tasks and model sizes. When applied to a 70B parameter model, SRT increases the win rate from 9.6\% to 25.8\% on the AlpacaEval 2.0 benchmark, surpassing well-established systems such as GPT-4-0314, Claude 2, and Gemini. Our analysis highlights the crucial role of language feedback in the success of SRT, suggesting potential for further exploration in this direction.

Related papers

Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones? [58.80794196076336]
Distilling large language models (LLMs) typically involves transferring the teacher model's responses through supervised fine-tuning (SFT) We propose a novel distillation pipeline that transfers both responses and rewards. Our method generates pseudo-rewards through a self-supervised mechanism that leverages the inherent structure of both teacher and student responses.
arXiv Detail & Related papers (2025-02-26T20:50:11Z)
AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO [0.0]
Large Language Models (LLMs) have demonstrated impressive capabilities in language processing, yet they often struggle with tasks requiring visual spatial reasoning. We introduce a novel two-stage training framework designed to equip standard LLMs with visual reasoning abilities for maze navigation.
arXiv Detail & Related papers (2025-02-20T16:05:18Z)
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data. We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z)
Self-Evolved Reward Learning for LLMs [45.6910747154447]
Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences. We propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself. Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance.
arXiv Detail & Related papers (2024-11-01T07:29:03Z)
Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only. Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z)
Is Crowdsourcing Breaking Your Bank? Cost-Effective Fine-Tuning of Pre-trained Language Models with Proximal Policy Optimization [18.75866961339424]
ChatGPT has highlighted the potential of reinforcement learning from human feedback. To reduce labor costs, we propose a self-supervised text ranking approach.
arXiv Detail & Related papers (2024-02-28T12:24:07Z)
A Critical Evaluation of AI Feedback for Aligning Large Language Models [60.42291111149438]
We show that simple supervised fine-tuning with GPT-4 as the teacher outperforms existing RLAIF pipelines. More generally, we find that the gains from RLAIF vary substantially across base model families, test-time evaluation protocols, and critic models.
arXiv Detail & Related papers (2024-02-19T18:53:54Z)
Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation [29.6763730290473]
Reinforcement learning can align language models with non-differentiable reward signals, such as human preferences. This paper introduces a novel framework that utilizes the critique capability of Large Language Models to produce intermediate-step rewards.
arXiv Detail & Related papers (2024-01-14T22:05:11Z)
Training Language Models with Language Feedback at Scale [50.70091340506957]
We introduce learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback.
arXiv Detail & Related papers (2023-03-28T17:04:15Z)
Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model. By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.