Personalisation within bounds: A risk taxonomy and policy framework for
the alignment of large language models with personalised feedback
- URL: http://arxiv.org/abs/2303.05453v1
- Date: Thu, 9 Mar 2023 17:52:07 GMT
- Title: Personalisation within bounds: A risk taxonomy and policy framework for
the alignment of large language models with personalised feedback
- Authors: Hannah Rose Kirk, Bertie Vidgen, Paul R\"ottger, Scott A. Hale
- Abstract summary: Large language models (LLMs) are used to generate content for a wide range of tasks, and are set to reach a growing audience in coming years.
This intensifies the need to ensure that models are aligned with human preferences and do not produce unsafe, inaccurate or toxic outputs.
Personalising LLMs through micro-level preference learning processes may result in models that are better aligned with each user.
- Score: 11.895749982167375
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are used to generate content for a wide range of
tasks, and are set to reach a growing audience in coming years due to
integration in product interfaces like ChatGPT or search engines like Bing.
This intensifies the need to ensure that models are aligned with human
preferences and do not produce unsafe, inaccurate or toxic outputs. While
alignment techniques like reinforcement learning with human feedback (RLHF) and
red-teaming can mitigate some safety concerns and improve model capabilities,
it is unlikely that an aggregate fine-tuning process can adequately represent
the full range of users' preferences and values. Different people may
legitimately disagree on their preferences for language and conversational
norms, as well as on values or ideologies which guide their communication.
Personalising LLMs through micro-level preference learning processes may result
in models that are better aligned with each user. However, there are several
normative challenges in defining the bounds of a societally-acceptable and safe
degree of personalisation. In this paper, we ask how, and in what ways, LLMs
should be personalised. First, we review literature on current paradigms for
aligning LLMs with human feedback, and identify issues including (i) a lack of
clarity regarding what alignment means; (ii) a tendency of technology providers
to prescribe definitions of inherently subjective preferences and values; and
(iii) a 'tyranny of the crowdworker', exacerbated by a lack of documentation in
who we are really aligning to. Second, we present a taxonomy of benefits and
risks associated with personalised LLMs, for individuals and society at large.
Finally, we propose a three-tiered policy framework that allows users to
experience the benefits of personalised alignment, while restraining unsafe and
undesirable LLM-behaviours within (supra-)national and organisational bounds.
Related papers
- Personalization of Large Language Models: A Survey [131.00650432814268]
Personalization of Large Language Models (LLMs) has recently become increasingly important with a wide range of applications.
Most existing works on personalized LLMs have focused either entirely on (a) personalized text generation or (b) leveraging LLMs for personalization-related downstream applications, such as recommendation systems.
We introduce a taxonomy for personalized LLM usage and summarizing the key differences and challenges.
arXiv Detail & Related papers (2024-10-29T04:01:11Z) - MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time [50.41806216615488]
Large Language Models (LLMs) acquire extensive knowledge and remarkable abilities from extensive text corpora.
To make LLMs more usable, aligning them with human preferences is essential.
We propose an effective method, textbf MetaAlign, which aims to help LLMs dynamically align with various explicit or implicit preferences specified at inference time.
arXiv Detail & Related papers (2024-10-18T05:31:13Z) - LLM-CI: Assessing Contextual Integrity Norms in Language Models [1.1715858161748576]
Large language models (LLMs) may inadvertently encode societal preferences and norms.
This is especially challenging due to prompt sensitivity$-$small variations in prompts yield different responses.
We present LLM-CI, the first open-sourced framework to assess encoded norms.
arXiv Detail & Related papers (2024-09-05T17:50:31Z) - ABC Align: Large Language Model Alignment for Safety & Accuracy [0.0]
We present ABC Align, a novel alignment methodology for Large Language Models (LLMs)
We combine a set of data and methods that build on recent breakthroughs in synthetic data generation, preference optimisation, and post-training model quantisation.
Our unified approach mitigates bias and improves accuracy, while preserving reasoning capability, as measured against standard benchmarks.
arXiv Detail & Related papers (2024-08-01T06:06:25Z) - Exploring Safety-Utility Trade-Offs in Personalized Language Models [26.792174008353008]
We show that large language models (LLMs) suffer from personalization bias, where their performance is impacted when they are personalized to a user's identity.
We quantify personalization bias by evaluating the performance of LLMs along two axes - safety and utility.
We discuss several strategies to mitigate personalization bias using preference tuning and prompt-based defenses.
arXiv Detail & Related papers (2024-06-17T00:17:11Z) - Beyond Human Norms: Unveiling Unique Values of Large Language Models through Interdisciplinary Approaches [69.73783026870998]
This work proposes a novel framework, ValueLex, to reconstruct Large Language Models' unique value system from scratch.
Based on Lexical Hypothesis, ValueLex introduces a generative approach to elicit diverse values from 30+ LLMs.
We identify three core value dimensions, Competence, Character, and Integrity, each with specific subdimensions, revealing that LLMs possess a structured, albeit non-human, value system.
arXiv Detail & Related papers (2024-04-19T09:44:51Z) - ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy.
It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z) - DeAL: Decoding-time Alignment for Large Language Models [59.63643988872571]
Large Language Models (LLMs) are nowadays expected to generate content aligned with human preferences.
We propose DeAL, a framework that allows the user to customize reward functions and enables Detime Alignment of LLMs.
Our experiments show that we can DeAL with fine-grained trade-offs, improve adherence to alignment objectives, and address residual gaps in LLMs.
arXiv Detail & Related papers (2024-02-05T06:12:29Z) - Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs)
We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing.
We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.