What is Harm? Baby Don't Hurt Me! On the Impossibility of Complete Harm Specification in AI Alignment
- URL: http://arxiv.org/abs/2501.16448v1
- Date: Mon, 27 Jan 2025 19:13:39 GMT
- Title: What is Harm? Baby Don't Hurt Me! On the Impossibility of Complete Harm Specification in AI Alignment
- Authors: Robin Young,
- Abstract summary: "Do no harm" faces a fundamental challenge in artificial intelligence.
How can we specify what constitutes harm?
We show that complete harm specification is impossible for any system where harm is defined external to its specifications.
- Score: 0.0
- License:
- Abstract: "First, do no harm" faces a fundamental challenge in artificial intelligence: how can we specify what constitutes harm? While prior work treats harm specification as a technical hurdle to be overcome through better algorithms or more data, we argue this assumption is unsound. Drawing on information theory, we demonstrate that complete harm specification is fundamentally impossible for any system where harm is defined external to its specifications. This impossibility arises from an inescapable information-theoretic gap: the entropy of harm H(O) always exceeds the mutual information I(O;I) between ground truth harm O and a system's specifications I. We introduce two novel metrics: semantic entropy H(S) and the safety-capability ratio I(O;I)/H(O), to quantify these limitations. Through a progression of increasingly sophisticated specification attempts, we show why each approach must fail and why the resulting gaps are not mere engineering challenges but fundamental constraints akin to the halting problem. These results suggest a paradigm shift: rather than pursuing complete specifications, AI alignment research should focus on developing systems that can operate safely despite irreducible specification uncertainty.
Related papers
- Beyond Uncertainty: Evidential Deep Learning for Robust Video Temporal Grounding [49.973156959947346]
Existing Video Temporal Grounding (VTG) models excel in accuracy but often overlook open-world challenges posed by open-vocabulary queries and untrimmed videos.
We introduce a robust network module that benefits from a two-stage cross-modal alignment task.
It integrates Deep Evidential Regression (DER) to explicitly and thoroughly quantify uncertainty during training.
In response, we develop a simple yet effective Geom-regularizer that enhances the uncertainty learning framework from the ground up.
arXiv Detail & Related papers (2024-08-29T05:32:03Z) - Can a Bayesian Oracle Prevent Harm from an Agent? [48.12936383352277]
We consider estimating a context-dependent bound on the probability of violating a given safety specification.
Noting that different plausible hypotheses about the world could produce very different outcomes, we derive on the safety violation probability predicted under the true but unknown hypothesis.
We consider two forms of this result, in the iid case and in the non-iid case, and conclude with open problems towards turning such results into practical AI guardrails.
arXiv Detail & Related papers (2024-08-09T18:10:42Z) - System Theoretic View on Uncertainties [0.0]
We propose a system theoretic approach to handle performance limitations.
We derive a taxonomy based on uncertainty, i.e. lack of knowledge, as a root cause.
arXiv Detail & Related papers (2023-03-07T16:51:24Z) - The #DNN-Verification Problem: Counting Unsafe Inputs for Deep Neural
Networks [94.63547069706459]
#DNN-Verification problem involves counting the number of input configurations of a DNN that result in a violation of a safety property.
We propose a novel approach that returns the exact count of violations.
We present experimental results on a set of safety-critical benchmarks.
arXiv Detail & Related papers (2023-01-17T18:32:01Z) - Mitigating Covertly Unsafe Text within Natural Language Systems [55.26364166702625]
Uncontrolled systems may generate recommendations that lead to injury or life-threatening consequences.
In this paper, we distinguish types of text that can lead to physical harm and establish one particularly underexplored category: covertly unsafe text.
arXiv Detail & Related papers (2022-10-17T17:59:49Z) - Outlier Detection using AI: A Survey [0.0]
Outlier Detection (OD) is an ever-growing research field.
In this chapter, we discuss the progress of OD methods using AI techniques.
arXiv Detail & Related papers (2021-12-01T15:59:55Z) - Impossibility Results in AI: A Survey [3.198144010381572]
An impossibility theorem demonstrates that a particular problem or set of problems cannot be solved as described in the claim.
We have categorized impossibility theorems applicable to the domain of AI into five categories: deduction, indistinguishability, induction, tradeoffs, and intractability.
We conclude that deductive impossibilities deny 100%-guarantees for security.
arXiv Detail & Related papers (2021-09-01T16:52:13Z) - Counterfactual Explanations as Interventions in Latent Space [62.997667081978825]
Counterfactual explanations aim to provide to end users a set of features that need to be changed in order to achieve a desired outcome.
Current approaches rarely take into account the feasibility of actions needed to achieve the proposed explanations.
We present Counterfactual Explanations as Interventions in Latent Space (CEILS), a methodology to generate counterfactual explanations.
arXiv Detail & Related papers (2021-06-14T20:48:48Z) - Inspect, Understand, Overcome: A Survey of Practical Methods for AI
Safety [54.478842696269304]
The use of deep neural networks (DNNs) in safety-critical applications is challenging due to numerous model-inherent shortcomings.
In recent years, a zoo of state-of-the-art techniques aiming to address these safety concerns has emerged.
Our paper addresses both machine learning experts and safety engineers.
arXiv Detail & Related papers (2021-04-29T09:54:54Z) - Towards Probability-based Safety Verification of Systems with Components
from Machine Learning [8.75682288556859]
Safety verification of machine learning systems is currently thought to be infeasible or, at least, very hard.
We think that it requires taking into account specific properties of ML technology such as: (i) Most ML approaches are inductive, which is both their power and their source of error.
We propose verification based on probabilities of errors both estimated by controlled experiments and output by the inductively learned itself.
arXiv Detail & Related papers (2020-03-02T19:31:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.