Parameter-Efficient Tuning Helps Language Model Alignment
- URL: http://arxiv.org/abs/2310.00819v1
- Date: Sun, 1 Oct 2023 23:27:14 GMT
- Title: Parameter-Efficient Tuning Helps Language Model Alignment
- Authors: Tianci Xue, Ziqi Wang, Heng Ji
- Abstract summary: Previous works mainly adopt reinforcement learning (RLHF) and direct preference optimization (DPO) with human feedback for alignment.
Controllable generation offers more flexibility with regard to data format.
Our approach, alignMEnt with parameter-Efficient Tuning (MEET), improves the quality of control tokens.
- Score: 57.27390187540737
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aligning large language models (LLMs) with human preferences is essential for
safe and useful LLMs. Previous works mainly adopt reinforcement learning (RLHF)
and direct preference optimization (DPO) with human feedback for alignment.
Nevertheless, they have certain drawbacks. One such limitation is that they can
only align models with one preference at the training time (e.g., they cannot
learn to generate concise responses when the preference data prefers detailed
responses), or have certain constraints for the data format (e.g., DPO only
supports pairwise preference data). To this end, prior works incorporate
controllable generations for alignment to make language models learn multiple
preferences and provide outputs with different preferences during inference if
asked. Controllable generation also offers more flexibility with regard to data
format (e.g., it supports pointwise preference data). Specifically, it uses
different control tokens for different preferences during training and
inference, making LLMs behave differently when required. Current controllable
generation methods either use a special token or hand-crafted prompts as
control tokens, and optimize them together with LLMs. As control tokens are
typically much lighter than LLMs, this optimization strategy may not
effectively optimize control tokens. To this end, we first use
parameter-efficient tuning (e.g., prompting tuning and low-rank adaptation) to
optimize control tokens and then fine-tune models for controllable generations,
similar to prior works. Our approach, alignMEnt with parameter-Efficient Tuning
(MEET), improves the quality of control tokens, thus improving controllable
generation quality consistently by an apparent margin on two well-recognized
datasets compared with prior works.
Related papers
- MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time [50.41806216615488]
Large Language Models (LLMs) acquire extensive knowledge and remarkable abilities from extensive text corpora.
To make LLMs more usable, aligning them with human preferences is essential.
We propose an effective method, textbf MetaAlign, which aims to help LLMs dynamically align with various explicit or implicit preferences specified at inference time.
arXiv Detail & Related papers (2024-10-18T05:31:13Z) - Orchestrating LLMs with Different Personalizations [28.344891363780576]
This paper presents a novel approach to aligning large language models (LLMs) with individual human preferences.
Given stated preferences along multiple dimensions, such as helpfulness, conciseness, or humor, the goal is to create an LLM without re-training that best adheres to this specification.
Starting from specialized expert LLMs, each trained for one particular preference dimension, we propose a black-box method that merges their outputs on a per-token level.
arXiv Detail & Related papers (2024-07-04T22:55:02Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - Token-level Direct Preference Optimization [8.249403373337024]
Fine-tuning pre-trained Large Language Models is essential to align them with human values and intentions.
We introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level.
arXiv Detail & Related papers (2024-04-18T08:49:38Z) - Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
A common technique for aligning large language models (LLMs) relies on acquiring human preferences.
We propose a new axis that is based on eliciting preferences jointly over the instruction-response pairs.
We find that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs.
arXiv Detail & Related papers (2024-03-31T02:05:40Z) - Arithmetic Control of LLMs for Diverse User Preferences: Directional
Preference Alignment with Multi-Objective Rewards [32.799198549439716]
We introduce the Directional Preference Alignment (DPA) framework for aligning large language models (LLMs)
Unlike the scalar-reward RLHF, DPA incorporates multi-objective reward modeling to represent diverse preference profiles.
Our method provides straightforward arithmetic control over the trade-off between helpfulness and verbosity.
arXiv Detail & Related papers (2024-02-28T18:58:25Z) - Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts.
RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z) - Getting the most out of your tokenizer for pre-training and domain
adaptation [26.427537023771844]
We show that the size, pre-tokenization regular expression, and training data of a tokenizer can significantly impact the model's generation speed.
We specialize the tokenizer of a pre-trained LLM to obtain large gains in generation speed and effective context size.
arXiv Detail & Related papers (2024-02-01T21:49:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.