ABC Align: Large Language Model Alignment for Safety & Accuracy
- URL: http://arxiv.org/abs/2408.00307v1
- Date: Thu, 1 Aug 2024 06:06:25 GMT
- Title: ABC Align: Large Language Model Alignment for Safety & Accuracy
- Authors: Gareth Seneque, Lap-Hang Ho, Ariel Kuperman, Nafise Erfanian Saeedi, Jeffrey Molendijk,
- Abstract summary: We present ABC Align, a novel alignment methodology for Large Language Models (LLMs)
We combine a set of data and methods that build on recent breakthroughs in synthetic data generation, preference optimisation, and post-training model quantisation.
Our unified approach mitigates bias and improves accuracy, while preserving reasoning capability, as measured against standard benchmarks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Alignment of Large Language Models (LLMs) remains an unsolved problem. Human preferences are highly distributed and can be captured at multiple levels of abstraction, from the individual to diverse populations. Organisational preferences, represented by standards and principles, are defined to mitigate reputational risk or meet legislative obligations. In this paper, we present ABC Align, a novel alignment methodology for LLMs that enables integration of the standards and preferences of a large media organisation into the LLM itself. We combine a set of data and methods that build on recent breakthroughs in synthetic data generation, preference optimisation, and post-training model quantisation. Our unified approach mitigates bias and improves accuracy, while preserving reasoning capability, as measured against standard benchmarks.
Related papers
- MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time [50.41806216615488]
Large Language Models (LLMs) acquire extensive knowledge and remarkable abilities from extensive text corpora.
To make LLMs more usable, aligning them with human preferences is essential.
We propose an effective method, textbf MetaAlign, which aims to help LLMs dynamically align with various explicit or implicit preferences specified at inference time.
arXiv Detail & Related papers (2024-10-18T05:31:13Z) - LLM-CI: Assessing Contextual Integrity Norms in Language Models [1.1715858161748576]
Large language models (LLMs) may inadvertently encode societal preferences and norms.
This is especially challenging due to prompt sensitivity$-$small variations in prompts yield different responses.
We present LLM-CI, the first open-sourced framework to assess encoded norms.
arXiv Detail & Related papers (2024-09-05T17:50:31Z) - Editable Fairness: Fine-Grained Bias Mitigation in Language Models [52.66450426729818]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.
FAST surpasses state-of-the-art baselines with superior debiasing performance.
This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z) - Exposing Privacy Gaps: Membership Inference Attack on Preference Data for LLM Alignment [8.028743532294532]
We introduce a novel reference-based attack framework specifically for analyzing preference data called PREMIA.
We provide empirical evidence that DPO models are more vulnerable to MIA compared to PPO models.
arXiv Detail & Related papers (2024-07-08T22:53:23Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - Don't Forget Your Reward Values: Language Model Alignment via
Value-based Calibration [26.467379188463028]
We propose a novel textbfValue-based textbfCalitextbfBration (VCB) method to better align Large Language Models with human preferences.
Experimental results demonstrate that VCB surpasses existing alignment methods on AI assistant and summarization datasets.
arXiv Detail & Related papers (2024-02-25T08:45:10Z) - Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback [70.32795295142648]
Linear alignment is a novel algorithm that aligns language models with human preferences in one single inference step.
Experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment.
arXiv Detail & Related papers (2024-01-21T10:46:23Z) - Large Language Model (LLM) Bias Index -- LLMBI [0.0]
The Large Language Model Bias Index (LLMBI) is a pioneering approach designed to quantify and address biases inherent in large language models (LLMs)
We formulated LLMBI using a composite scoring system incorporating multiple dimensions of bias, including but not limited to age, gender, and racial biases.
Our empirical analysis, conducted using responses from OpenAI's API, employs advanced sentiment analysis as a representative method for bias detection.
arXiv Detail & Related papers (2023-12-22T15:38:13Z) - Personalisation within bounds: A risk taxonomy and policy framework for
the alignment of large language models with personalised feedback [11.895749982167375]
Large language models (LLMs) are used to generate content for a wide range of tasks, and are set to reach a growing audience in coming years.
This intensifies the need to ensure that models are aligned with human preferences and do not produce unsafe, inaccurate or toxic outputs.
Personalising LLMs through micro-level preference learning processes may result in models that are better aligned with each user.
arXiv Detail & Related papers (2023-03-09T17:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.