RAIN: Your Language Models Can Align Themselves without Finetuning
- URL: http://arxiv.org/abs/2309.07124v2
- Date: Mon, 9 Oct 2023 03:34:01 GMT
- Title: RAIN: Your Language Models Can Align Themselves without Finetuning
- Authors: Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, Hongyang Zhang
- Abstract summary: Large language models (LLMs) often demonstrate inconsistencies with human preferences.
We show that unaligned LLMs can directly produce responses consistent with human preferences via self-boosting.
We introduce a novel inference method, Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate their own generation.
- Score: 25.703729145091483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) often demonstrate inconsistencies with human
preferences. Previous research typically gathered human preference data and
then aligned the pre-trained models using reinforcement learning or instruction
tuning, a.k.a. the finetuning step. In contrast, aligning frozen LLMs without
requiring alignment data is more appealing. This work explores the potential of
the latter setting. We discover that by integrating self-evaluation and rewind
mechanisms, unaligned LLMs can directly produce responses consistent with human
preferences via self-boosting. We introduce a novel inference method,
Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to
evaluate their own generation and use the evaluation results to guide rewind
and generation for AI safety. Notably, RAIN operates without the need of extra
data for model alignment and abstains from any training, gradient computation,
or parameter updates. Experimental results evaluated by GPT-4 and humans
demonstrate the effectiveness of RAIN: on the HH dataset, RAIN improves the
harmlessness rate of LLaMA 30B from 82% of vanilla inference to 97%, while
maintaining the helpfulness rate. On the TruthfulQA dataset, RAIN improves the
truthfulness of the already-well-aligned LLaMA-2-chat 13B model by 5%.
Related papers
- S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.
Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z) - GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment [36.52424795446663]
Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences.
Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining.
We introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model.
arXiv Detail & Related papers (2024-10-10T17:58:24Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress.
LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset.
Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z) - Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Reformatted Alignment [27.79684742862816]
Current methods to improve data quality are either labor-intensive or prone to factual errors caused by hallucinations.
This paper introduces a simple and effective approach named ReAlign, which reformats the responses of instruction data into a format that better aligns with pre-established criteria and the collated evidence.
Experimentally, ReAlign significantly boosts the general alignment ability, math reasoning, factuality, and readability of the LLMs.
arXiv Detail & Related papers (2024-02-19T15:21:58Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.