Related papers: SoFA: Shielded On-the-fly Alignment via Priority Rule Following

SoFA: Shielded On-the-fly Alignment via Priority Rule Following

URL: http://arxiv.org/abs/2402.17358v1
Date: Tue, 27 Feb 2024 09:52:27 GMT
Title: SoFA: Shielded On-the-fly Alignment via Priority Rule Following
Authors: Xinyu Lu, Bowen Yu, Yaojie Lu, Hongyu Lin, Haiyang Yu, Le Sun, Xianpei Han, Yongbin Li
Abstract summary: This paper introduces a novel alignment paradigm, priority rule following, which defines rules as the primary control mechanism in each dialog. We present PriorityDistill, a semi-automated approach for distilling priority following signals from simulations to ensure robust rule integration and adherence.
Score: 90.32819418613407
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The alignment problem in Large Language Models (LLMs) involves adapting them to the broad spectrum of human values. This requirement challenges existing alignment methods due to diversity of preferences and regulatory standards. This paper introduces a novel alignment paradigm, priority rule following, which defines rules as the primary control mechanism in each dialog, prioritizing them over user instructions. Our preliminary analysis reveals that even the advanced LLMs, such as GPT-4, exhibit shortcomings in understanding and prioritizing the rules. Therefore, we present PriorityDistill, a semi-automated approach for distilling priority following signals from LLM simulations to ensure robust rule integration and adherence. Our experiments show that this method not only effectively minimizes misalignments utilizing only one general rule but also adapts smoothly to various unseen rules, ensuring they are shielded from hijacking and that the model responds appropriately.

Related papers

Customize Multi-modal RAI Guardrails with Precedent-based predictions [55.63757336900865]
A multi-modal guardrail must effectively filter image content based on user-defined policies.<n>Existing fine-tuning methods typically condition predictions on pre-defined policies.<n>We propose to condition model's judgment on "precedents", which are the reasoning processes of prior data points similar to the given input.
arXiv Detail & Related papers (2025-07-28T03:45:34Z)
RuleGenie: SIEM Detection Rule Set Optimization [24.312198733476063]
Redundant or overlapping rules in SIEM systems lead to excessive false alerts, degrading analyst performance due to alert fatigue, and increase computational overhead.<n>We present RuleGenie, a novel large language model (LLM) aided recommender system designed to optimize SIEM rule sets.
arXiv Detail & Related papers (2025-05-10T16:56:17Z)
Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems.
arXiv Detail & Related papers (2025-02-11T13:10:34Z)
Deliberative Alignment: Reasoning Enables Safer Language Models [64.60765108418062]
We introduce Deliberative Alignment, a new paradigm that teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. We used this approach to align OpenAI's o-series models, and achieved highly precise adherence to OpenAI's safety policies, without requiring human-written chain-of-thoughts or answers.
arXiv Detail & Related papers (2024-12-20T21:00:11Z)
Neuro-Symbolic Rule Lists [31.085257698392354]
NeuRules is an end-to-end trainable model that unifies discretization, rule learning, and rule order into a single framework. We show that NeuRules consistently outperforms neuro-symbolic methods, effectively learning simple and complex rules, as well as their order, across a wide range of datasets.
arXiv Detail & Related papers (2024-11-10T11:10:36Z)
A Scalable Matrix Visualization for Understanding Tree Ensemble Classifiers [20.416696003269674]
This paper introduces a scalable visual analysis method to explain tree ensemble classifiers that contain tens of thousands of rules. We develop an anomaly-biased model reduction method to prioritize these rules at each hierarchical level. Our method fosters a deeper understanding of both common and anomalous rules, thereby enhancing interpretability without sacrificing comprehensiveness.
arXiv Detail & Related papers (2024-09-05T01:48:11Z)
ABC Align: Large Language Model Alignment for Safety & Accuracy [0.0]
We present ABC Align, a novel alignment methodology for Large Language Models (LLMs) We combine a set of data and methods that build on recent breakthroughs in synthetic data generation, preference optimisation, and post-training model quantisation. Our unified approach mitigates bias and improves accuracy, while preserving reasoning capability, as measured against standard benchmarks.
arXiv Detail & Related papers (2024-08-01T06:06:25Z)
Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions. We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z)
Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [103.12563033438715]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values. Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z)
Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback [70.32795295142648]
Linear alignment is a novel algorithm that aligns language models with human preferences in one single inference step. Experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment.
arXiv Detail & Related papers (2024-01-21T10:46:23Z)
SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision. We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z)
On Regularization and Inference with Label Constraints [62.60903248392479]
We compare two strategies for encoding label constraints in a machine learning pipeline, regularization with constraints and constrained inference. For regularization, we show that it narrows the generalization gap by precluding models that are inconsistent with the constraints. For constrained inference, we show that it reduces the population risk by correcting a model's violation, and hence turns the violation into an advantage.
arXiv Detail & Related papers (2023-07-08T03:39:22Z)
Distilling Task-specific Logical Rules from Large Pre-trained Models [24.66436804853525]
We develop a novel framework to distill task-specific logical rules from large pre-trained models. Specifically, we borrow recent prompt-based language models as the knowledge expert to yield initial seed rules. Experiments on three public named entity tagging benchmarks demonstrate the effectiveness of our proposed framework.
arXiv Detail & Related papers (2022-10-06T09:12:18Z)
A Distributional Approach to Controlled Text Generation [3.279201607581627]
We propose a Distributional Approach to address Controlled Text Generation from pre-trained Language Models (LMs) This view permits to define, in a single formal framework, "pointwise" and "distributional" constraints over the target LM. We then perform experiments over distributional constraints, a unique feature of our approach, demonstrating its potential as a remedy to the problem of Bias in Language Models.
arXiv Detail & Related papers (2020-12-21T19:02:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.