SoFA: Shielded On-the-fly Alignment via Priority Rule Following
- URL: http://arxiv.org/abs/2402.17358v1
- Date: Tue, 27 Feb 2024 09:52:27 GMT
- Title: SoFA: Shielded On-the-fly Alignment via Priority Rule Following
- Authors: Xinyu Lu, Bowen Yu, Yaojie Lu, Hongyu Lin, Haiyang Yu, Le Sun, Xianpei
Han, Yongbin Li
- Abstract summary: This paper introduces a novel alignment paradigm, priority rule following, which defines rules as the primary control mechanism in each dialog.
We present PriorityDistill, a semi-automated approach for distilling priority following signals from simulations to ensure robust rule integration and adherence.
- Score: 90.32819418613407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The alignment problem in Large Language Models (LLMs) involves adapting them
to the broad spectrum of human values. This requirement challenges existing
alignment methods due to diversity of preferences and regulatory standards.
This paper introduces a novel alignment paradigm, priority rule following,
which defines rules as the primary control mechanism in each dialog,
prioritizing them over user instructions. Our preliminary analysis reveals that
even the advanced LLMs, such as GPT-4, exhibit shortcomings in understanding
and prioritizing the rules. Therefore, we present PriorityDistill, a
semi-automated approach for distilling priority following signals from LLM
simulations to ensure robust rule integration and adherence. Our experiments
show that this method not only effectively minimizes misalignments utilizing
only one general rule but also adapts smoothly to various unseen rules,
ensuring they are shielded from hijacking and that the model responds
appropriately.
Related papers
- Neuro-Symbolic Rule Lists [31.085257698392354]
NeuRules is an end-to-end trainable model that unifies discretization, rule learning, and rule order into a single framework.
We show that NeuRules consistently outperforms neuro-symbolic methods, effectively learning simple and complex rules, as well as their order, across a wide range of datasets.
arXiv Detail & Related papers (2024-11-10T11:10:36Z) - A Scalable Matrix Visualization for Understanding Tree Ensemble Classifiers [20.416696003269674]
This paper introduces a scalable visual analysis method to explain tree ensemble classifiers that contain tens of thousands of rules.
We develop an anomaly-biased model reduction method to prioritize these rules at each hierarchical level.
Our method fosters a deeper understanding of both common and anomalous rules, thereby enhancing interpretability without sacrificing comprehensiveness.
arXiv Detail & Related papers (2024-09-05T01:48:11Z) - ABC Align: Large Language Model Alignment for Safety & Accuracy [0.0]
We present ABC Align, a novel alignment methodology for Large Language Models (LLMs)
We combine a set of data and methods that build on recent breakthroughs in synthetic data generation, preference optimisation, and post-training model quantisation.
Our unified approach mitigates bias and improves accuracy, while preserving reasoning capability, as measured against standard benchmarks.
arXiv Detail & Related papers (2024-08-01T06:06:25Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [103.12563033438715]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values.
Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives.
We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z) - Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback [70.32795295142648]
Linear alignment is a novel algorithm that aligns language models with human preferences in one single inference step.
Experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment.
arXiv Detail & Related papers (2024-01-21T10:46:23Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z) - On Regularization and Inference with Label Constraints [62.60903248392479]
We compare two strategies for encoding label constraints in a machine learning pipeline, regularization with constraints and constrained inference.
For regularization, we show that it narrows the generalization gap by precluding models that are inconsistent with the constraints.
For constrained inference, we show that it reduces the population risk by correcting a model's violation, and hence turns the violation into an advantage.
arXiv Detail & Related papers (2023-07-08T03:39:22Z) - Distilling Task-specific Logical Rules from Large Pre-trained Models [24.66436804853525]
We develop a novel framework to distill task-specific logical rules from large pre-trained models.
Specifically, we borrow recent prompt-based language models as the knowledge expert to yield initial seed rules.
Experiments on three public named entity tagging benchmarks demonstrate the effectiveness of our proposed framework.
arXiv Detail & Related papers (2022-10-06T09:12:18Z) - A Distributional Approach to Controlled Text Generation [3.279201607581627]
We propose a Distributional Approach to address Controlled Text Generation from pre-trained Language Models (LMs)
This view permits to define, in a single formal framework, "pointwise" and "distributional" constraints over the target LM.
We then perform experiments over distributional constraints, a unique feature of our approach, demonstrating its potential as a remedy to the problem of Bias in Language Models.
arXiv Detail & Related papers (2020-12-21T19:02:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.