AI Alignment and Social Choice: Fundamental Limitations and Policy
Implications
- URL: http://arxiv.org/abs/2310.16048v1
- Date: Tue, 24 Oct 2023 17:59:04 GMT
- Title: AI Alignment and Social Choice: Fundamental Limitations and Policy
Implications
- Authors: Abhilash Mishra
- Abstract summary: Reinforcement learning with human feedback (RLHF) has emerged as the key framework for AI alignment.
In this paper, we investigate a specific challenge in building RLHF systems that respect democratic norms.
We show that aligning AI agents with the values of all individuals will always violate certain private ethical preferences of an individual user.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Aligning AI agents to human intentions and values is a key bottleneck in
building safe and deployable AI applications. But whose values should AI agents
be aligned with? Reinforcement learning with human feedback (RLHF) has emerged
as the key framework for AI alignment. RLHF uses feedback from human
reinforcers to fine-tune outputs; all widely deployed large language models
(LLMs) use RLHF to align their outputs to human values. It is critical to
understand the limitations of RLHF and consider policy challenges arising from
these limitations. In this paper, we investigate a specific challenge in
building RLHF systems that respect democratic norms. Building on impossibility
results in social choice theory, we show that, under fairly broad assumptions,
there is no unique voting protocol to universally align AI systems using RLHF
through democratic processes. Further, we show that aligning AI agents with the
values of all individuals will always violate certain private ethical
preferences of an individual user i.e., universal AI alignment using RLHF is
impossible. We discuss policy implications for the governance of AI systems
built using RLHF: first, the need for mandating transparent voting rules to
hold model builders accountable. Second, the need for model builders to focus
on developing AI agents that are narrowly aligned to specific user groups.
Related papers
- Direct Advantage Regression: Aligning LLMs with Online AI Reward [59.78549819431632]
Online AI Feedback (OAIF) presents a promising alternative to Reinforcement Learning from Human Feedback (RLHF)
We propose Direct Advantage Regression (DAR) to optimize policy improvement through weighted supervised fine-tuning.
Our empirical results underscore that AI reward is a better form of AI supervision consistently achieving higher human-AI agreement as opposed to AI preference.
arXiv Detail & Related papers (2025-04-19T04:44:32Z) - Do LLMs trust AI regulation? Emerging behaviour of game-theoretic LLM agents [61.132523071109354]
This paper investigates the interplay between AI developers, regulators and users, modelling their strategic choices under different regulatory scenarios.
Our research identifies emerging behaviours of strategic AI agents, which tend to adopt more "pessimistic" stances than pure game-theoretic agents.
arXiv Detail & Related papers (2025-04-11T15:41:21Z) - RA-PbRL: Provably Efficient Risk-Aware Preference-Based Reinforcement Learning [7.407106653769627]
We introduce Risk-AwarePbRL, an algorithm designed to optimize both nested and static objectives.
We also provide a theoretical analysis of the regret upper bounds, demonstrating that they are sublinear with respect to the number of episodes, and present empirical results.
arXiv Detail & Related papers (2024-10-31T02:25:43Z) - Engineering Trustworthy AI: A Developer Guide for Empirical Risk Minimization [53.80919781981027]
Key requirements for trustworthy AI can be translated into design choices for the components of empirical risk minimization.
We hope to provide actionable guidance for building AI systems that meet emerging standards for trustworthiness of AI.
arXiv Detail & Related papers (2024-10-25T07:53:32Z) - Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback.
Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference.
Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z) - A Hormetic Approach to the Value-Loading Problem: Preventing the
Paperclip Apocalypse? [0.0]
We propose HALO (Hormetic ALignment via Opponent processes), a regulatory paradigm that uses hormetic analysis to regulate the behavioral patterns of AI.
We show how HALO can solve the 'paperclip maximizer' scenario, a thought experiment where an unregulated AI tasked with making paperclips could end up converting all matter in the universe into paperclips.
Our approach may be used to help create an evolving database of 'values' based on the hedonic calculus of repeatable behaviors with decreasing marginal utility.
arXiv Detail & Related papers (2024-02-12T07:49:48Z) - Principle-Driven Self-Alignment of Language Models from Scratch with
Minimal Human Supervision [84.31474052176343]
Recent AI-assistant agents, such as ChatGPT, rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback to align the output with human intentions.
This dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision.
We propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision.
arXiv Detail & Related papers (2023-05-04T17:59:28Z) - Fairness in AI and Its Long-Term Implications on Society [68.8204255655161]
We take a closer look at AI fairness and analyze how lack of AI fairness can lead to deepening of biases over time.
We discuss how biased models can lead to more negative real-world outcomes for certain groups.
If the issues persist, they could be reinforced by interactions with other risks and have severe implications on society in the form of social unrest.
arXiv Detail & Related papers (2023-04-16T11:22:59Z) - Perspectives on the Social Impacts of Reinforcement Learning with Human
Feedback [0.0]
Reinforcement learning with human feedback (RLHF) has emerged as a strong candidate toward allowing agents to learn from human feedback in a naturalistic manner.
It has been catapulted into public view by multiple high-profile AI applications, including OpenAI's ChatGPT, DeepMind's Sparrow, and Anthropic's Claude.
Our objectives are threefold: to provide a systematic study of the social effects of RLHF; to identify key social and ethical issues of RLHF; and to discuss social impacts for stakeholders.
arXiv Detail & Related papers (2023-03-06T04:49:38Z) - A Seven-Layer Model for Standardising AI Fairness Assessment [0.5076419064097732]
We elaborate that the AI system is prone to biases at every stage of its lifecycle, from inception to its usage.
We propose a novel seven-layer model, inspired by the Open System Interconnection (OSI) model, to standardise AI fairness handling.
arXiv Detail & Related papers (2022-12-21T17:28:07Z) - Constitutional AI: Harmlessness from AI Feedback [19.964791766072132]
We experiment with methods for training a harmless AI assistant through self-improvement.
The only human oversight is provided through a list of rules or principles.
We are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them.
arXiv Detail & Related papers (2022-12-15T06:19:23Z) - When to Make Exceptions: Exploring Language Models as Accounts of Human
Moral Judgment [96.77970239683475]
AI systems need to be able to understand, interpret and predict human moral judgments and decisions.
A central challenge for AI safety is capturing the flexibility of the human moral mind.
We present a novel challenge set consisting of rule-breaking question answering.
arXiv Detail & Related papers (2022-10-04T09:04:27Z) - Aligning Artificial Intelligence with Humans through Public Policy [0.0]
This essay outlines research on AI that learn structures in policy data that can be leveraged for downstream tasks.
We believe this represents the "comprehension" phase of AI and policy, but leveraging policy as a key source of human values to align AI requires "understanding" policy.
arXiv Detail & Related papers (2022-06-25T21:31:14Z) - Cybertrust: From Explainable to Actionable and Interpretable AI (AI2) [58.981120701284816]
Actionable and Interpretable AI (AI2) will incorporate explicit quantifications and visualizations of user confidence in AI recommendations.
It will allow examining and testing of AI system predictions to establish a basis for trust in the systems' decision making.
arXiv Detail & Related papers (2022-01-26T18:53:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.