SoFA: Shielded On-the-fly Alignment via Priority Rule Following
- URL: http://arxiv.org/abs/2402.17358v1
- Date: Tue, 27 Feb 2024 09:52:27 GMT
- Title: SoFA: Shielded On-the-fly Alignment via Priority Rule Following
- Authors: Xinyu Lu, Bowen Yu, Yaojie Lu, Hongyu Lin, Haiyang Yu, Le Sun, Xianpei
Han, Yongbin Li
- Abstract summary: This paper introduces a novel alignment paradigm, priority rule following, which defines rules as the primary control mechanism in each dialog.
We present PriorityDistill, a semi-automated approach for distilling priority following signals from simulations to ensure robust rule integration and adherence.
- Score: 90.32819418613407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The alignment problem in Large Language Models (LLMs) involves adapting them
to the broad spectrum of human values. This requirement challenges existing
alignment methods due to diversity of preferences and regulatory standards.
This paper introduces a novel alignment paradigm, priority rule following,
which defines rules as the primary control mechanism in each dialog,
prioritizing them over user instructions. Our preliminary analysis reveals that
even the advanced LLMs, such as GPT-4, exhibit shortcomings in understanding
and prioritizing the rules. Therefore, we present PriorityDistill, a
semi-automated approach for distilling priority following signals from LLM
simulations to ensure robust rule integration and adherence. Our experiments
show that this method not only effectively minimizes misalignments utilizing
only one general rule but also adapts smoothly to various unseen rules,
ensuring they are shielded from hijacking and that the model responds
appropriately.
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion [44.95386817008473]
We introduce Contrastive Policy Gradient, or CoPG, a simple and mathematically principled new RL algorithm that can estimate the optimal policy even from off-policy data.
We show this approach to generalize the direct alignment method IPO (identity preference optimization) and classic policy gradient.
We experiment with the proposed CoPG on a toy bandit problem to illustrate its properties, as well as for finetuning LLMs on a summarization task.
arXiv Detail & Related papers (2024-06-27T14:03:49Z) - Controllable Preference Optimization: Toward Controllable
Multi-Objective Alignment [107.63756895544842]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values.
Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives.
We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z) - Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback [70.32795295142648]
Linear alignment is a novel algorithm that aligns language models with human preferences in one single inference step.
Experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment.
arXiv Detail & Related papers (2024-01-21T10:46:23Z) - On Finding Bi-objective Pareto-optimal Fraud Prevention Rule Sets for Fintech Applications [0.949959422062959]
Rules are widely used in institutions to make fraud prevention decisions.
This paper focuses on improving the flexibility and efficacy of a two-stage framework for rule mining.
arXiv Detail & Related papers (2023-11-02T03:18:40Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z) - On Regularization and Inference with Label Constraints [62.60903248392479]
We compare two strategies for encoding label constraints in a machine learning pipeline, regularization with constraints and constrained inference.
For regularization, we show that it narrows the generalization gap by precluding models that are inconsistent with the constraints.
For constrained inference, we show that it reduces the population risk by correcting a model's violation, and hence turns the violation into an advantage.
arXiv Detail & Related papers (2023-07-08T03:39:22Z) - Distilling Task-specific Logical Rules from Large Pre-trained Models [24.66436804853525]
We develop a novel framework to distill task-specific logical rules from large pre-trained models.
Specifically, we borrow recent prompt-based language models as the knowledge expert to yield initial seed rules.
Experiments on three public named entity tagging benchmarks demonstrate the effectiveness of our proposed framework.
arXiv Detail & Related papers (2022-10-06T09:12:18Z) - A Distributional Approach to Controlled Text Generation [3.279201607581627]
We propose a Distributional Approach to address Controlled Text Generation from pre-trained Language Models (LMs)
This view permits to define, in a single formal framework, "pointwise" and "distributional" constraints over the target LM.
We then perform experiments over distributional constraints, a unique feature of our approach, demonstrating its potential as a remedy to the problem of Bias in Language Models.
arXiv Detail & Related papers (2020-12-21T19:02:41Z) - Diverse Rule Sets [20.170305081348328]
Rule-based systems are experiencing a renaissance owing to their intuitive if-then representation.
We propose a novel approach of inferring diverse rule sets, by optimizing small overlap among decision rules.
We then devise an efficient randomized algorithm, which samples rules that are highly discriminative and have small overlap.
arXiv Detail & Related papers (2020-06-17T14:15:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.