Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes
- URL: http://arxiv.org/abs/2412.13998v1
- Date: Wed, 18 Dec 2024 16:14:59 GMT
- Title: Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes
- Authors: Katarzyna Kobalczyk, Claudio Fanconi, Hao Sun, Mihaela van der Schaar,
- Abstract summary: Large language models (LLMs) are increasingly embedded in everyday applications.
Ensuring their alignment with the diverse preferences of individual users has become a critical challenge.
We present a novel framework for few-shot steerable alignment.
- Score: 50.544186914115045
- License:
- Abstract: As large language models (LLMs) become increasingly embedded in everyday applications, ensuring their alignment with the diverse preferences of individual users has become a critical challenge. Currently deployed approaches typically assume homogeneous user objectives and rely on single-objective fine-tuning. However, human preferences are inherently heterogeneous, influenced by various unobservable factors, leading to conflicting signals in preference data. Existing solutions addressing this diversity often require costly datasets labelled for specific objectives and involve training multiple reward models or LLM policies, which is computationally expensive and impractical. In this work, we present a novel framework for few-shot steerable alignment, where users' underlying preferences are inferred from a small sample of their choices. To achieve this, we extend the Bradley-Terry-Luce model to handle heterogeneous preferences with unobserved variability factors and propose its practical implementation for reward modelling and LLM fine-tuning. Thanks to our proposed approach of functional parameter-space conditioning, LLMs trained with our framework can be adapted to individual preferences at inference time, generating outputs over a continuum of behavioural modes. We empirically validate the effectiveness of methods, demonstrating their ability to capture and align with diverse human preferences in a data-efficient manner. Our code is made available at: https://github.com/kasia-kobalczyk/few-shot-steerable-alignment.
Related papers
- Optimizing Large Language Models for Dynamic Constraints through Human-in-the-Loop Discriminators [0.0]
Large Language Models (LLMs) have recently demonstrated impressive capabilities across various real-world applications.
We propose a flexible framework that enables LLMs to interact with system interfaces, summarize constraint concepts, and continually optimize performance metrics.
Our framework achieved a $7.78%$ pass rate with the human discriminator and a $6.11%$ pass rate with the LLM-based discriminator.
arXiv Detail & Related papers (2024-10-19T17:27:38Z) - MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time [50.41806216615488]
Large Language Models (LLMs) acquire extensive knowledge and remarkable abilities from extensive text corpora.
To make LLMs more usable, aligning them with human preferences is essential.
We propose an effective method, textbf MetaAlign, which aims to help LLMs dynamically align with various explicit or implicit preferences specified at inference time.
arXiv Detail & Related papers (2024-10-18T05:31:13Z) - Aligning LLMs with Individual Preferences via Interaction [51.72200436159636]
We train large language models (LLMs) that can ''interact to align''
We develop a multi-turn preference dataset containing 3K+ multi-turn conversations in tree structures.
For evaluation, we establish the ALOE benchmark, consisting of 100 carefully selected examples and well-designed metrics to measure the customized alignment performance during conversations.
arXiv Detail & Related papers (2024-10-04T17:48:29Z) - Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning [12.742158403867002]
Reinforcement Learning from Human Feedback is a powerful paradigm for aligning foundation models to human values and preferences.
Current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population.
We develop a class of multimodal RLHF methods to address the need for pluralistic alignment.
arXiv Detail & Related papers (2024-08-19T15:18:30Z) - Pareto-Optimal Learning from Preferences with Hidden Context [17.590330740964266]
We propose POPL, which enables pluralistic alignment by framing discrepant group preferences as objectives with potential trade-offs.
Our theoretical and empirical evaluations demonstrate that POPL surpasses baseline methods in learning sets of reward functions and policies.
We illustrate that POPL can also serve as a foundation for techniques optimizing specific notions of group fairness.
arXiv Detail & Related papers (2024-06-21T18:57:38Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [102.16105233826917]
Learning from preference labels plays a crucial role in fine-tuning large language models.
There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning.
arXiv Detail & Related papers (2024-04-22T17:20:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.