RLCD: Reinforcement Learning from Contrastive Distillation for Language Model Alignment
- URL: http://arxiv.org/abs/2307.12950v3
- Date: Sat, 16 Mar 2024 04:22:09 GMT
- Title: RLCD: Reinforcement Learning from Contrastive Distillation for Language Model Alignment
- Authors: Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, Yuandong Tian,
- Abstract summary: Reinforcement Learning from Contrastive Distillation (RLCD) is a method for aligning language models without using human feedback.
RLCD creates preference pairs from two contrasting model outputs, one using a positive prompt designed to encourage following the given principles, and one using a negative prompt designed to encourage violating them.
We then use the preference pairs to train a preference model, which is in turn used to improve a base unaligned language model via reinforcement learning.
- Score: 121.45689748315125
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose Reinforcement Learning from Contrastive Distillation (RLCD), a method for aligning language models to follow principles expressed in natural language (e.g., to be more harmless) without using human feedback. RLCD creates preference pairs from two contrasting model outputs, one using a positive prompt designed to encourage following the given principles, and one using a negative prompt designed to encourage violating them. Using two different prompts causes model outputs to be more differentiated on average, resulting in cleaner preference labels in the absence of human annotations. We then use the preference pairs to train a preference model, which is in turn used to improve a base unaligned language model via reinforcement learning. Empirically, RLCD outperforms RLAIF (Bai et al., 2022b) and context distillation (Huang et al., 2022) baselines across three diverse alignment tasks--harmlessness, helpfulness, and story outline generation--and when using both 7B and 30B model scales for simulating preference data.
Related papers
- Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models [22.613040767122225]
We propose a Preference-Aligned Distillation framework, which models teacher's preference knowledge as a probability distribution over all potential preferences.
Experiments on four mainstream alignment benchmarks demonstrate that PAD consistently and significantly outperforms existing approaches.
arXiv Detail & Related papers (2025-02-20T05:18:23Z) - Multi-objective Reinforcement learning from AI Feedback [0.0]
This paper presents a novel approach to improve the alignment and performance of language models trained using reinforcement learning from AI feedback (RLAIF)
In contrast to standard approaches that train a single preference model to represent all human preferences, MORLAIF decomposes this task into simpler principles, such as toxicity, factuality, and sycophancy.
Our experiments indicate that MORLAIF outperforms the standard RLAIF baselines and that MORLAIF can be used to align larger language models using smaller ones.
arXiv Detail & Related papers (2024-06-11T14:24:00Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - CycleAlign: Iterative Distillation from Black-box LLM to White-box
Models for Better Human Alignment [25.15541878967559]
Language models trained on large-scale corpus often generate content that is harmful, toxic, or contrary to human preferences.
We introduce CycleAlign to distill alignment capabilities from parameter-invisible LLMs (black-box) to a parameter-visible model (white-box) in an iterative manner.
We show that CycleAlign remarkably exceeds existing methods, and achieves the state-of-the-art performance in alignment with human value.
arXiv Detail & Related papers (2023-10-25T01:05:03Z) - Training Language Models with Language Feedback at Scale [50.70091340506957]
We introduce learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback.
ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements.
We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback.
arXiv Detail & Related papers (2023-03-28T17:04:15Z) - Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding.
We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z) - Robust Preference Learning for Storytelling via Contrastive
Reinforcement Learning [53.92465205531759]
Controlled automated story generation seeks to generate natural language stories satisfying constraints from natural language critiques or preferences.
We train a contrastive bi-encoder model to align stories with human critiques, building a general purpose preference model.
We further fine-tune the contrastive reward model using a prompt-learning technique to increase story generation robustness.
arXiv Detail & Related papers (2022-10-14T13:21:33Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.