DeAL: Decoding-time Alignment for Large Language Models
- URL: http://arxiv.org/abs/2402.06147v2
- Date: Wed, 21 Feb 2024 02:25:32 GMT
- Title: DeAL: Decoding-time Alignment for Large Language Models
- Authors: James Y. Huang, Sailik Sengupta, Daniele Bonadiman, Yi-an Lai, Arshit
Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, Dan Roth
- Abstract summary: Large Language Models (LLMs) are nowadays expected to generate content aligned with human preferences.
We propose DeAL, a framework that allows the user to customize reward functions and enables Detime Alignment of LLMs.
Our experiments show that we can DeAL with fine-grained trade-offs, improve adherence to alignment objectives, and address residual gaps in LLMs.
- Score: 59.63643988872571
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are nowadays expected to generate content
aligned with human preferences. Current work focuses on alignment at model
training time, through techniques such as Reinforcement Learning with Human
Feedback (RLHF). However, it is unclear if such methods are an effective choice
to teach alignment objectives to the model. First, the inability to incorporate
multiple, custom rewards and reliance on a model developer's view of universal
and static principles are key limitations. Second, the residual gaps in model
training and the reliability of such approaches are also questionable (e.g.
susceptibility to jail-breaking even after safety training). To address these,
we propose DeAL, a framework that allows the user to customize reward functions
and enables Decoding-time Alignment of LLMs (DeAL). At its core, we view
decoding as a heuristic-guided search process and facilitate the use of a wide
variety of alignment objectives. Our experiments with programmatic constraints
such as keyword and length constraints (studied widely in the pre-LLM era) and
abstract objectives such as harmlessness and helpfulness (proposed in the
post-LLM era) show that we can DeAL with fine-grained trade-offs, improve
adherence to alignment objectives, and address residual gaps in LLMs. Lastly,
while DeAL can be effectively paired with RLHF and prompting techniques, its
generality makes decoding slower, an optimization we leave for future work.
Related papers
- Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization [34.29833630422768]
Adversarial Contrastive Decoding (ACD) is an optimization-based framework to generate two opposite system prompts for prompt-based contrastive decoding.
ACD achieves much better safety performance than previous model training-free decoding methods without sacrificing original generation ability.
arXiv Detail & Related papers (2024-06-24T15:51:30Z) - Aligning Large Language Models with Representation Editing: A Control Perspective [38.71496554018039]
Fine-tuning large language models (LLMs) to align with human objectives is crucial for real-world applications.
Test-time alignment techniques, such as prompting and guided decoding, do not modify the underlying model.
We propose aligning LLMs through representation editing.
arXiv Detail & Related papers (2024-06-10T01:21:31Z) - One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a dualization perspective that reduces constrained alignment to an equivalent unconstrained alignment problem.
We do so by pre-optimizing a smooth and convex dual function that has a closed form.
Our strategy leads to two practical algorithms in model-based and preference-based scenarios.
arXiv Detail & Related papers (2024-05-29T22:12:52Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [90.4820014819937]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - RLSF: Reinforcement Learning via Symbolic Feedback [12.238296793643942]
We propose a new training/fine-tuning paradigm we refer to as Reinforcement Learning via Symbolic Feedback (RLSF)
In RLSF, the LLM that is being trained/fine-tuned is considered as the RL agent, while the environment is allowed access to reasoning tools.
We show that our RLSF-based fine-tuning of LLMs outperforms traditional approaches on two different applications.
arXiv Detail & Related papers (2024-05-26T18:49:59Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z) - Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint [104.53687944498155]
Reinforcement learning (RL) has been widely used in training large language models (LLMs)
We propose a new RL method named RLMEC that incorporates a generative model as the reward model.
Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process.
arXiv Detail & Related papers (2024-01-11T17:58:41Z) - Fundamental Limitations of Alignment in Large Language Models [16.393916864600193]
An important aspect in developing language models that interact with humans is aligning their behavior to be useful and unharmful.
This is usually achieved by tuning the model in a way that enhances desired behaviors and inhibits undesired ones.
We propose a theoretical approach called Behavior Expectation Bounds (BEB) which allows us to formally investigate several inherent characteristics and limitations of alignment in large language models.
arXiv Detail & Related papers (2023-04-19T17:50:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.