Personalized Soups: Personalized Large Language Model Alignment via
Post-hoc Parameter Merging
- URL: http://arxiv.org/abs/2310.11564v1
- Date: Tue, 17 Oct 2023 20:22:13 GMT
- Title: Personalized Soups: Personalized Large Language Model Alignment via
Post-hoc Parameter Merging
- Authors: Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel,
Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, Prithviraj Ammanabrolu
- Abstract summary: We study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem.
LLMs are aligned to multiple preferences by modeling alignment as a Multi-Objective Reinforcement Learning (MORL) problem.
We show that we can achieve personalized alignment by decomposing preferences into multiple dimensions.
- Score: 148.77027765872006
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Reinforcement Learning from Human Feedback (RLHF) aligns Large Language
Models (LLMs) with general, aggregate human preferences, it is suboptimal for
learning diverse, individual perspectives. In this work, we study Reinforcement
Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are
aligned to multiple (sometimes conflicting) preferences by modeling alignment
as a Multi-Objective Reinforcement Learning (MORL) problem. Compared to strong
single-objective baselines, we show that we can achieve personalized alignment
by decomposing preferences into multiple dimensions. These dimensions are
defined based on personalizations that are declared as desirable by the user.
In this work, we show that they can be efficiently trained independently in a
distributed manner and combined effectively post-hoc through parameter merging.
The code is available at https://github.com/joeljang/RLPHF.
Related papers
- Orchestrating LLMs with Different Personalizations [28.344891363780576]
This paper presents a novel approach to aligning large language models (LLMs) with individual human preferences.
Given stated preferences along multiple dimensions, such as helpfulness, conciseness, or humor, the goal is to create an LLM without re-training that best adheres to this specification.
Starting from specialized expert LLMs, each trained for one particular preference dimension, we propose a black-box method that merges their outputs on a per-token level.
arXiv Detail & Related papers (2024-07-04T22:55:02Z) - Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling [34.32744849352087]
We propose a method that sequentially fine-tunes large language models to align with human preferences.
We theoretically derive closed-form optimal SPO policy and loss function.
Empirical results on LLMs of different size and multiple evaluation datasets demonstrate that SPO successfully aligns LLMs across multiple dimensions of human preferences.
arXiv Detail & Related papers (2024-05-21T12:47:17Z) - PICLe: Eliciting Diverse Behaviors from Large Language Models with Persona In-Context Learning [20.39414674098941]
Large Language Models (LLMs) are trained on massive text corpora, which are encoded with diverse personality traits.
We formalize the persona elicitation task, aiming to customize LLM behaviors to align with a target persona.
We present Persona In-Context Learning (PICLe), a novel persona elicitation framework grounded in Bayesian inference.
arXiv Detail & Related papers (2024-05-03T22:17:22Z) - Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts.
RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z) - Personalized Language Modeling from Personalized Human Feedback [49.344833339240566]
Reinforcement Learning from Human Feedback (RLHF) is commonly used to fine-tune large language models to better align with human preferences.
In this work, we aim to address this problem by developing methods for building personalized language models.
arXiv Detail & Related papers (2024-02-06T04:18:58Z) - Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback [70.32795295142648]
Linear alignment is a novel algorithm that aligns language models with human preferences in one single inference step.
Experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment.
arXiv Detail & Related papers (2024-01-21T10:46:23Z) - Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct
Preference Optimization [78.50294936259026]
We present Multi-Objective Direct Preference Optimization (MODPO) for multiple alignment objectives with minimal overheads.
MODPO folds language modeling directly into reward modeling, training LMs as implicit collective reward models (cRMs) that combine all objectives with specific weightings.
While theoretically guaranteed to produce the same optimal solutions as MORLHF, MODPO is practically more stable and computationally efficient.
arXiv Detail & Related papers (2023-10-05T17:35:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.