DeAL: Decoding-time Alignment for Large Language Models
- URL: http://arxiv.org/abs/2402.06147v2
- Date: Wed, 21 Feb 2024 02:25:32 GMT
- Title: DeAL: Decoding-time Alignment for Large Language Models
- Authors: James Y. Huang, Sailik Sengupta, Daniele Bonadiman, Yi-an Lai, Arshit
Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, Dan Roth
- Abstract summary: Large Language Models (LLMs) are nowadays expected to generate content aligned with human preferences.
We propose DeAL, a framework that allows the user to customize reward functions and enables Detime Alignment of LLMs.
Our experiments show that we can DeAL with fine-grained trade-offs, improve adherence to alignment objectives, and address residual gaps in LLMs.
- Score: 59.63643988872571
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are nowadays expected to generate content
aligned with human preferences. Current work focuses on alignment at model
training time, through techniques such as Reinforcement Learning with Human
Feedback (RLHF). However, it is unclear if such methods are an effective choice
to teach alignment objectives to the model. First, the inability to incorporate
multiple, custom rewards and reliance on a model developer's view of universal
and static principles are key limitations. Second, the residual gaps in model
training and the reliability of such approaches are also questionable (e.g.
susceptibility to jail-breaking even after safety training). To address these,
we propose DeAL, a framework that allows the user to customize reward functions
and enables Decoding-time Alignment of LLMs (DeAL). At its core, we view
decoding as a heuristic-guided search process and facilitate the use of a wide
variety of alignment objectives. Our experiments with programmatic constraints
such as keyword and length constraints (studied widely in the pre-LLM era) and
abstract objectives such as harmlessness and helpfulness (proposed in the
post-LLM era) show that we can DeAL with fine-grained trade-offs, improve
adherence to alignment objectives, and address residual gaps in LLMs. Lastly,
while DeAL can be effectively paired with RLHF and prompting techniques, its
generality makes decoding slower, an optimization we leave for future work.
Related papers
- Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control [44.326363467045496]
Large Language Models (LLMs) have become a critical area of research in Reinforcement Learning from Human Feedback (RLHF)
representation engineering offers a new, training-free approach.
This technique leverages semantic features to control the representation of LLM's intermediate hidden states.
It is difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature.
arXiv Detail & Related papers (2024-11-04T08:36:03Z) - MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time [50.41806216615488]
Large Language Models (LLMs) acquire extensive knowledge and remarkable abilities from extensive text corpora.
To make LLMs more usable, aligning them with human preferences is essential.
We propose an effective method, textbf MetaAlign, which aims to help LLMs dynamically align with various explicit or implicit preferences specified at inference time.
arXiv Detail & Related papers (2024-10-18T05:31:13Z) - zsLLMCode: An Effective Approach for Functional Code Embedding via LLM with Zero-Shot Learning [6.976968804436321]
Large language models (LLMs) have the capability of zero-shot learning, which does not require training or fine-tuning.
We propose zsLLMCode, a novel approach that generates functional code embeddings using LLMs.
arXiv Detail & Related papers (2024-09-23T01:03:15Z) - Assessing the Zero-Shot Capabilities of LLMs for Action Evaluation in RL [14.091146805312636]
Credit assignment problem is a central challenge in Reinforcement Learning (RL)
Credit Assignment with Language Models (CALM) is a novel approach to automate credit assignment via reward shaping and options discovery.
Preliminary results indicate that the knowledge of Large Language Models is a promising prior for credit assignment in RL.
arXiv Detail & Related papers (2024-09-19T14:08:09Z) - Aligning Large Language Models with Representation Editing: A Control Perspective [38.71496554018039]
Fine-tuning large language models (LLMs) to align with human objectives is crucial for real-world applications.
Test-time alignment techniques, such as prompting and guided decoding, do not modify the underlying model.
We propose aligning LLMs through representation editing.
arXiv Detail & Related papers (2024-06-10T01:21:31Z) - One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a dualization perspective that reduces constrained alignment to an equivalent unconstrained alignment problem.
We do so by pre-optimizing a smooth and convex dual function that has a closed form.
Our strategy leads to two practical algorithms in model-based and preference-based scenarios.
arXiv Detail & Related papers (2024-05-29T22:12:52Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.