SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to
RLHF
- URL: http://arxiv.org/abs/2310.05344v1
- Date: Mon, 9 Oct 2023 02:11:21 GMT
- Title: SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to
RLHF
- Authors: Yi Dong, Zhilin Wang, Makesh Narsimhan Sreedhar, Xianchao Wu, Oleksii
Kuchaiev
- Abstract summary: We propose SteerLM, a supervised fine-tuning method that empowers end-users to control responses during inference.
SteerLM conditions responses to conform to an explicitly defined multi-dimensional set of attributes, thereby empowering a steerable AI capable of generating helpful and high-quality responses.
- Score: 19.43122743768123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model alignment with human preferences is an essential step in making Large
Language Models (LLMs) helpful and consistent with human values. It typically
consists of supervised fine-tuning (SFT) and reinforcement learning from human
feedback (RLHF) stages. However, RLHF faces inherent limitations stemming from
a complex training setup and its tendency to align the model with implicit
values that end users cannot control at run-time. Moreover, reward models in
RLHF stage commonly rely on single-dimensional feedback as opposed to explicit,
multifaceted signals that indicate attributes such as helpfulness, humor, and
toxicity. To address these limitations, we propose SteerLM, a supervised
fine-tuning method that empowers end-users to control responses during
inference. SteerLM conditions responses to conform to an explicitly defined
multi-dimensional set of attributes, thereby empowering a steerable AI capable
of generating helpful and high-quality responses while maintaining
customizability. Experiments show that SteerLM trained on open source datasets
generates responses that are preferred by human and automatic evaluators to
many state-of-the-art baselines trained with RLHF while being much easier to
train. Try SteerLM at https://huggingface.co/nvidia/SteerLM-llama2-13B
Related papers
- Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision [34.594109869213014]
We simulate unreliable demonstrations and comparison feedback using small language models and humans.
We find that in the presence of unreliable supervision, SFT still retains some effectiveness, but DPO (a common RLHF algorithm) fails to improve the model beyond SFT.
Our findings suggest that as LMs are used for complex tasks where human supervision is unreliable, RLHF may no longer be the best use of human comparison feedback.
arXiv Detail & Related papers (2025-01-14T06:54:17Z) - Language Models Learn to Mislead Humans via RLHF [100.95201965748343]
Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex.
We study this phenomenon under a standard RLHF pipeline, calling it "U-SOPHISTRY" since it is Unintended by model developers.
Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them.
arXiv Detail & Related papers (2024-09-19T14:50:34Z) - RLHF Workflow: From Reward Modeling to Online RLHF [79.83927049253924]
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report.
RLHF is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature.
We show that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets.
arXiv Detail & Related papers (2024-05-13T15:50:39Z) - ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback [86.87638927637005]
ChatGLM is a free-to-use AI service powered by large language models (LLMs)
We present the ChatGLM-RLHF pipeline, designed to enhance ChatGLM's alignment with human preferences.
arXiv Detail & Related papers (2024-04-01T05:39:36Z) - TeaMs-RL: Teaching LLMs to Generate Better Instruction Datasets via Reinforcement Learning [7.9961739811640244]
Development of Large Language Models often confronts challenges stemming from heavy reliance on human annotators.
In this work, we pivot to Reinforcement Learning -- but with a twist.
We use RL to directly generate the foundational instruction dataset that alone suffices for fine-tuning.
arXiv Detail & Related papers (2024-03-13T16:57:57Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z) - RRHF: Rank Responses to Align Language Models with Human Feedback
without tears [69.68672043223249]
InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO)
We propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities.
We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling.
arXiv Detail & Related papers (2023-04-11T15:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.