Related papers: SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF

SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF

URL: http://arxiv.org/abs/2310.05344v1
Date: Mon, 9 Oct 2023 02:11:21 GMT
Title: SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF
Authors: Yi Dong, Zhilin Wang, Makesh Narsimhan Sreedhar, Xianchao Wu, Oleksii Kuchaiev
Abstract summary: We propose SteerLM, a supervised fine-tuning method that empowers end-users to control responses during inference. SteerLM conditions responses to conform to an explicitly defined multi-dimensional set of attributes, thereby empowering a steerable AI capable of generating helpful and high-quality responses.
Score: 19.43122743768123
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Model alignment with human preferences is an essential step in making Large Language Models (LLMs) helpful and consistent with human values. It typically consists of supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) stages. However, RLHF faces inherent limitations stemming from a complex training setup and its tendency to align the model with implicit values that end users cannot control at run-time. Moreover, reward models in RLHF stage commonly rely on single-dimensional feedback as opposed to explicit, multifaceted signals that indicate attributes such as helpfulness, humor, and toxicity. To address these limitations, we propose SteerLM, a supervised fine-tuning method that empowers end-users to control responses during inference. SteerLM conditions responses to conform to an explicitly defined multi-dimensional set of attributes, thereby empowering a steerable AI capable of generating helpful and high-quality responses while maintaining customizability. Experiments show that SteerLM trained on open source datasets generates responses that are preferred by human and automatic evaluators to many state-of-the-art baselines trained with RLHF while being much easier to train. Try SteerLM at https://huggingface.co/nvidia/SteerLM-llama2-13B

Related papers

Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision [34.594109869213014]
We simulate unreliable demonstrations and comparison feedback using small language models and humans. We find that in the presence of unreliable supervision, SFT still retains some effectiveness, but DPO (a common RLHF algorithm) fails to improve the model beyond SFT. Our findings suggest that as LMs are used for complex tasks where human supervision is unreliable, RLHF may no longer be the best use of human comparison feedback.
arXiv Detail & Related papers (2025-01-14T06:54:17Z)
Language Models Learn to Mislead Humans via RLHF [100.95201965748343]
Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. We study this phenomenon under a standard RLHF pipeline, calling it "U-SOPHISTRY" since it is Unintended by model developers. Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them.
arXiv Detail & Related papers (2024-09-19T14:50:34Z)
RLHF Workflow: From Reward Modeling to Online RLHF [79.83927049253924]
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report. RLHF is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. We show that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets.
arXiv Detail & Related papers (2024-05-13T15:50:39Z)
ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback [86.87638927637005]
ChatGLM is a free-to-use AI service powered by large language models (LLMs) We present the ChatGLM-RLHF pipeline, designed to enhance ChatGLM's alignment with human preferences.
arXiv Detail & Related papers (2024-04-01T05:39:36Z)
TeaMs-RL: Teaching LLMs to Generate Better Instruction Datasets via Reinforcement Learning [7.9961739811640244]
Development of Large Language Models often confronts challenges stemming from heavy reliance on human annotators. In this work, we pivot to Reinforcement Learning -- but with a twist. We use RL to directly generate the foundational instruction dataset that alone suffices for fine-tuning.
arXiv Detail & Related papers (2024-03-13T16:57:57Z)
SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision. We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
RRHF: Rank Responses to Align Language Models with Human Feedback without tears [69.68672043223249]
InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO) We propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities. We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling.
arXiv Detail & Related papers (2023-04-11T15:53:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.