Align on the Fly: Adapting Chatbot Behavior to Established Norms
- URL: http://arxiv.org/abs/2312.15907v1
- Date: Tue, 26 Dec 2023 06:51:09 GMT
- Title: Align on the Fly: Adapting Chatbot Behavior to Established Norms
- Authors: Chunpu Xu, Steffi Chern, Ethan Chern, Ge Zhang, Zekun Wang, Ruibo Liu,
Jing Li, Jie Fu, Pengfei Liu
- Abstract summary: We propose an On-the-fly Preference Optimization (OPO) method, which is a real-time alignment that works in a streaming way.
Experimental results on both human-annotated and auto-generated questions from legal and moral domains indicate the effectiveness of the proposed OPO method.
- Score: 47.34022081652952
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we aim to align large language models with the ever-changing,
complex, and diverse human values (e.g., social norms) across time and
locations. This presents a challenge to existing alignment techniques, such as
supervised fine-tuning, which internalize values within model parameters. To
overcome this, we propose an On-the-fly Preference Optimization (OPO) method,
which is a real-time alignment that works in a streaming way. It employs an
external memory to store established rules for alignment, which can constrain
LLMs' behaviors without further training, allowing for convenient updates and
customization of human values. We also introduce a scalable evaluation to
assess the proposed method more effectively. Experimental results on both
human-annotated and auto-generated questions from legal and moral domains
indicate the effectiveness of the proposed OPO method. Our code and data are
released at https://github.com/GAIR-NLP/OPO.
Related papers
- GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets [19.485572131953937]
We propose a practical application of a diversity-seeking RL algorithm called GFlowNet-DPO (GDPO) in an offline preference alignment setting.
Empirical results show GDPO can generate far more diverse responses than the baseline methods.
arXiv Detail & Related papers (2024-10-19T13:07:52Z) - Ordinal Preference Optimization: Aligning Human Preferences via NDCG [28.745322441961438]
We develop an end-to-end preference optimization algorithm by approxing NDCG with a differentiable surrogate loss.
OPO outperforms existing pairwise and listwise approaches on evaluation sets and general benchmarks like AlpacaEval.
arXiv Detail & Related papers (2024-10-06T03:49:28Z) - SAIL: Self-Improving Efficient Online Alignment of Large Language Models [56.59644677997827]
Reinforcement Learning from Human Feedback is a key method for aligning large language models with human preferences.
Recent literature has focused on designing online RLHF methods but still lacks a unified conceptual formulation.
Our approach significantly improves alignment performance on open-sourced datasets with minimal computational overhead.
arXiv Detail & Related papers (2024-06-21T18:05:35Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z) - Token-level Direct Preference Optimization [8.249403373337024]
Fine-tuning pre-trained Large Language Models is essential to align them with human values and intentions.
We introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level.
arXiv Detail & Related papers (2024-04-18T08:49:38Z) - Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback [70.32795295142648]
Linear alignment is a novel algorithm that aligns language models with human preferences in one single inference step.
Experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment.
arXiv Detail & Related papers (2024-01-21T10:46:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.