RAIN: Your Language Models Can Align Themselves without Finetuning
- URL: http://arxiv.org/abs/2309.07124v2
- Date: Mon, 9 Oct 2023 03:34:01 GMT
- Title: RAIN: Your Language Models Can Align Themselves without Finetuning
- Authors: Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, Hongyang Zhang
- Abstract summary: Large language models (LLMs) often demonstrate inconsistencies with human preferences.
We show that unaligned LLMs can directly produce responses consistent with human preferences via self-boosting.
We introduce a novel inference method, Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate their own generation.
- Score: 25.703729145091483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) often demonstrate inconsistencies with human
preferences. Previous research typically gathered human preference data and
then aligned the pre-trained models using reinforcement learning or instruction
tuning, a.k.a. the finetuning step. In contrast, aligning frozen LLMs without
requiring alignment data is more appealing. This work explores the potential of
the latter setting. We discover that by integrating self-evaluation and rewind
mechanisms, unaligned LLMs can directly produce responses consistent with human
preferences via self-boosting. We introduce a novel inference method,
Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to
evaluate their own generation and use the evaluation results to guide rewind
and generation for AI safety. Notably, RAIN operates without the need of extra
data for model alignment and abstains from any training, gradient computation,
or parameter updates. Experimental results evaluated by GPT-4 and humans
demonstrate the effectiveness of RAIN: on the HH dataset, RAIN improves the
harmlessness rate of LLaMA 30B from 82% of vanilla inference to 97%, while
maintaining the helpfulness rate. On the TruthfulQA dataset, RAIN improves the
truthfulness of the already-well-aligned LLaMA-2-chat 13B model by 5%.
Related papers
- Self-Evolved Reward Learning for LLMs [45.6910747154447]
Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences.
We propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself.
Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance.
arXiv Detail & Related papers (2024-11-01T07:29:03Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment [36.52424795446663]
GenARM is a test-time alignment approach that leverages the Autoregressive Reward Model.
We show that GenARM significantly outperforms prior test-time alignment baselines.
It supports real-time trade-offs between preference dimensions and catering to diverse user preferences without retraining.
arXiv Detail & Related papers (2024-10-10T17:58:24Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Weak-to-Strong Extrapolation Expedites Alignment [135.12769233630362]
We propose a method called ExPO to boost models' alignment with human preference.
We demonstrate that ExPO consistently improves off-the-shelf DPO/RLHF models.
We shed light on the essence of ExPO amplifying the reward signal learned during alignment training.
arXiv Detail & Related papers (2024-04-25T17:39:50Z) - Reformatted Alignment [27.79684742862816]
Current methods to improve data quality are either labor-intensive or prone to factual errors caused by hallucinations.
This paper introduces a simple and effective approach named ReAlign, which reformats the responses of instruction data into a format that better aligns with pre-established criteria and the collated evidence.
Experimentally, ReAlign significantly boosts the general alignment ability, math reasoning, factuality, and readability of the LLMs.
arXiv Detail & Related papers (2024-02-19T15:21:58Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.