A Decision-Theoretic Approach for Managing Misalignment
- URL: http://arxiv.org/abs/2512.15584v2
- Date: Sun, 21 Dec 2025 22:26:06 GMT
- Title: A Decision-Theoretic Approach for Managing Misalignment
- Authors: Daniel A. Herrmann, Abinav Chari, Isabelle Qian, Sree Sharvesh, B. A. Levinstein,
- Abstract summary: We argue that rational delegation requires balancing an agent's value with its accuracy and its reach.<n>We show that context-specific delegation can be optimal even with significant misalignment.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When should we delegate decisions to AI systems? While the value alignment literature has developed techniques for shaping AI values, less attention has been paid to how to determine, under uncertainty, when imperfect alignment is good enough to justify delegation. We argue that rational delegation requires balancing an agent's value (mis)alignment with its epistemic accuracy and its reach (the acts it has available). This paper introduces a formal, decision-theoretic framework to analyze this tradeoff precisely accounting for a principal's uncertainty about these factors. Our analysis reveals a sharp distinction between two delegation scenarios. First, universal delegation (trusting an agent with any problem) demands near-perfect value alignment and total epistemic trust, conditions rarely met in practice. Second, we show that context-specific delegation can be optimal even with significant misalignment. An agent's superior accuracy or expanded reach may grant access to better overall decision problems, making delegation rational in expectation. We develop a novel scoring framework to quantify this ex ante decision. Ultimately, our work provides a principled method for determining when an AI is aligned enough for a given context, shifting the focus from achieving perfect alignment to managing the risks and rewards of delegation under uncertainty.
Related papers
- The Principle of Proportional Duty: A Knowledge-Duty Framework for Ethical Equilibrium in Human and Artificial Systems [0.0]
This paper introduces the Principle of Proportional Duty (PPD), a novel framework that models how ethical responsibility scales with an agent's epistemic state.<n>As uncertainty increases, Action Duty (the duty to act decisively) is proportionally converted into Repair Duty (the active duty to verify, inquire, and resolve uncertainty).<n>This paper applies the framework across four domains, clinical ethics, recipient-rights law, economic governance, and artificial intelligence, to demonstrate its cross-disciplinary validity.
arXiv Detail & Related papers (2025-12-07T02:37:07Z) - Normative active inference: A numerical proof of principle for a computational and economic legal analytic approach to AI governance [0.6267988254367711]
This paper presents a computational account of how legal norms can influence the behavior of artificial intelligence (AI) agents.<n>We propose that lawful and norm-sensitive AI behavior can be achieved through regulation by design, where agents are endowed with intentional control systems.<n>We conclude by discussing how context-dependent preferences could function as safety mechanisms for autonomous agents.
arXiv Detail & Related papers (2025-11-24T17:30:51Z) - Judicial Requirements for Generative AI in Legal Reasoning [0.0]
Large Language Models (LLMs) are being integrated into professional domains, yet their limitations in high-stakes fields like law remain poorly understood.<n>This paper defines the core capabilities that an AI system must possess to function as a reliable reasoning tool in judicial decision-making.
arXiv Detail & Related papers (2025-08-26T09:56:26Z) - Resource Rational Contractualism Should Guide AI Alignment [69.07915246220985]
Contractualist alignment proposes grounding decisions in agreements that diverse stakeholders would endorse.<n>We propose Resource-Rationalism: a framework where AI systems approximate the agreements rational parties would form.<n>An RRC-aligned agent would not only operate efficiently, but also be equipped to dynamically adapt to and interpret the ever-changing human social world.
arXiv Detail & Related papers (2025-06-20T18:57:13Z) - AI Alignment at Your Discretion [7.133218044328296]
In AI alignment, latitude must be granted to annotators, either human or algorithmic, to judge which model outputs are better' or safer'<n>Such discretion remains largely unexamined, posing two risks: (i) annotators may use their power of discretion arbitrarily, and (ii) models may fail to mimic this discretion.<n>By measuring both human and algorithmic discretion over safety alignment datasets, we reveal layers of discretion in the alignment process that were previously unaccounted for.
arXiv Detail & Related papers (2025-02-10T09:19:52Z) - Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.<n>We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.<n>We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z) - Online Decision Mediation [72.80902932543474]
Consider learning a decision support assistant to serve as an intermediary between (oracle) expert behavior and (imperfect) human behavior.
In clinical diagnosis, fully-autonomous machine behavior is often beyond ethical affordances.
arXiv Detail & Related papers (2023-10-28T05:59:43Z) - Formalizing the Problem of Side Effect Regularization [81.97441214404247]
We propose a formal criterion for side effect regularization via the assistance game framework.
In these games, the agent solves a partially observable Markov decision process.
We show that this POMDP is solved by trading off the proxy reward with the agent's ability to achieve a range of future tasks.
arXiv Detail & Related papers (2022-06-23T16:36:13Z) - Byzantine-Robust Online and Offline Distributed Reinforcement Learning [60.970950468309056]
We consider a distributed reinforcement learning setting where multiple agents explore the environment and communicate their experiences through a central server.
$alpha$-fraction of agents are adversarial and can report arbitrary fake information.
We seek to identify a near-optimal policy for the underlying Markov decision process in the presence of these adversarial agents.
arXiv Detail & Related papers (2022-06-01T00:44:53Z) - VCG Mechanism Design with Unknown Agent Values under Stochastic Bandit
Feedback [104.06766271716774]
We study a multi-round welfare-maximising mechanism design problem in instances where agents do not know their values.
We first define three notions of regret for the welfare, the individual utilities of each agent and that of the mechanism.
Our framework also provides flexibility to control the pricing scheme so as to trade-off between the agent and seller regrets.
arXiv Detail & Related papers (2020-04-19T18:00:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.