CycleAlign: Iterative Distillation from Black-box LLM to White-box
Models for Better Human Alignment
- URL: http://arxiv.org/abs/2310.16271v1
- Date: Wed, 25 Oct 2023 01:05:03 GMT
- Title: CycleAlign: Iterative Distillation from Black-box LLM to White-box
Models for Better Human Alignment
- Authors: Jixiang Hong, Quan Tu, Changyu Chen, Xing Gao, Ji Zhang, Rui Yan
- Abstract summary: Language models trained on large-scale corpus often generate content that is harmful, toxic, or contrary to human preferences.
We introduce CycleAlign to distill alignment capabilities from parameter-invisible LLMs (black-box) to a parameter-visible model (white-box) in an iterative manner.
We show that CycleAlign remarkably exceeds existing methods, and achieves the state-of-the-art performance in alignment with human value.
- Score: 25.15541878967559
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models trained on large-scale corpus often generate content that is
harmful, toxic, or contrary to human preferences, making their alignment with
human values a critical concern. Reinforcement learning from human feedback
(RLHF) with algorithms like PPO is a prevalent approach for alignment but is
often complex, unstable, and resource-intensive. Recently, ranking-based
alignment methods have emerged, offering stability and effectiveness by
replacing the RL framework with supervised fine-tuning, but they are costly due
to the need for annotated data. Considering that existing large language models
(LLMs) like ChatGPT are already relatively well-aligned and cost-friendly,
researchers have begun to align the language model with human preference from
AI feedback. The common practices, which unidirectionally distill the
instruction-following responses from LLMs, are constrained by their bottleneck.
Thus we introduce CycleAlign to distill alignment capabilities from
parameter-invisible LLMs (black-box) to a parameter-visible model (white-box)
in an iterative manner. With in-context learning (ICL) as the core of the
cycle, the black-box models are able to rank the model-generated responses
guided by human-craft instruction and demonstrations about their preferences.
During iterative interaction, the white-box models also have a judgment about
responses generated by them. Consequently, the agreement ranking could be
viewed as a pseudo label to dynamically update the in-context demonstrations
and improve the preference ranking ability of black-box models. Through
multiple interactions, the CycleAlign framework could align the white-box model
with the black-box model effectively in a low-resource way. Empirical results
illustrate that the model fine-tuned by CycleAlign remarkably exceeds existing
methods, and achieves the state-of-the-art performance in alignment with human
value.
Related papers
- MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs.
We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms.
Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z) - Predicting the Performance of Black-box LLMs through Self-Queries [60.87193950962585]
Large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial.
In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations.
We demonstrate that training a linear model on these low-dimensional representations produces reliable predictors of model performance at the instance level.
arXiv Detail & Related papers (2025-01-02T22:26:54Z) - Multi-objective Reinforcement learning from AI Feedback [0.0]
This paper presents a novel approach to improve the alignment and performance of language models trained using reinforcement learning from AI feedback (RLAIF)
In contrast to standard approaches that train a single preference model to represent all human preferences, MORLAIF decomposes this task into simpler principles, such as toxicity, factuality, and sycophancy.
Our experiments indicate that MORLAIF outperforms the standard RLAIF baselines and that MORLAIF can be used to align larger language models using smaller ones.
arXiv Detail & Related papers (2024-06-11T14:24:00Z) - Aligning Large Language Models via Fine-grained Supervision [20.35000061196631]
Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations.
Current approaches focus on using reinforcement learning with human feedback to improve model alignment.
We propose a method to enhance LLM alignment through fine-grained token-level supervision.
arXiv Detail & Related papers (2024-06-04T20:21:45Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - RecExplainer: Aligning Large Language Models for Explaining Recommendation Models [50.74181089742969]
Large language models (LLMs) have demonstrated remarkable intelligence in understanding, reasoning, and instruction following.
This paper presents the initial exploration of using LLMs as surrogate models to explain black-box recommender models.
To facilitate an effective alignment, we introduce three methods: behavior alignment, intention alignment, and hybrid alignment.
arXiv Detail & Related papers (2023-11-18T03:05:43Z) - Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding.
We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z) - ILLUME: Rationalizing Vision-Language Models through Human Interactions [18.701950647429]
We propose a tuning paradigm based on human interactions with machine-generated data.
Our ILLUME executes the following loop: Given an image-question-answer prompt, the VLM samples multiple candidate rationales, and a human critic provides feedback via preference selection.
This loop increases the training data and gradually carves out the VLM's rationalization capabilities that are aligned with human intent.
arXiv Detail & Related papers (2022-08-17T11:41:43Z) - Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition [55.362258027878966]
We present momentum pseudo-labeling (MPL) as a simple yet effective strategy for semi-supervised speech recognition.
MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method.
The experimental results demonstrate that MPL effectively improves over the base model and is scalable to different semi-supervised scenarios.
arXiv Detail & Related papers (2021-06-16T16:24:55Z) - Early Stage LM Integration Using Local and Global Log-Linear Combination [46.91755970827846]
Sequence-to-sequence models with an implicit alignment mechanism (e.g. attention) are closing the performance gap towards traditional hybrid hidden Markov models (HMM)
One important factor to improve word error rate in both cases is the use of an external language model (LM) trained on large text-only corpora.
We present a novel method for language model integration into implicit-alignment based sequence-to-sequence models.
arXiv Detail & Related papers (2020-05-20T13:49:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.