The dangers in algorithms learning humans' values and irrationalities
- URL: http://arxiv.org/abs/2202.13985v2
- Date: Tue, 1 Mar 2022 11:23:04 GMT
- Title: The dangers in algorithms learning humans' values and irrationalities
- Authors: Rebecca Gorman, Stuart Armstrong
- Abstract summary: AI systems that are trained on human behavior risk miscategorising human irrationalities as human values.
Knowing human policy allows an AI to become generically more powerful.
It is better for the AI to learn human values directly, rather than learning human biases and then deducing values from behaviour.
- Score: 4.606850300668693
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For an artificial intelligence (AI) to be aligned with human values (or human
preferences), it must first learn those values. AI systems that are trained on
human behavior, risk miscategorising human irrationalities as human values --
and then optimising for these irrationalities. Simply learning human values
still carries risks: AI learning them will inevitably also gain information on
human irrationalities and human behaviour/policy. Both of these can be
dangerous: knowing human policy allows an AI to become generically more
powerful (whether it is partially aligned or not aligned at all), while
learning human irrationalities allows it to exploit humans without needing to
provide value in return. This paper analyses the danger in developing
artificial intelligence that learns about human irrationalities and human
policy, and constructs a model recommendation system with various levels of
information about human biases, human policy, and human values. It concludes
that, whatever the power and knowledge of the AI, it is more dangerous for it
to know human irrationalities than human values. Thus it is better for the AI
to learn human values directly, rather than learning human biases and then
deducing values from behaviour.
Related papers
- Measurement of LLM's Philosophies of Human Nature [113.47929131143766]
We design the standardized psychological scale specifically targeting large language models (LLM)
We show that current LLMs exhibit a systemic lack of trust in humans.
We propose a mental loop learning framework, which enables LLM to continuously optimize its value system.
arXiv Detail & Related papers (2025-04-03T06:22:19Z) - When Autonomy Breaks: The Hidden Existential Risk of AI [0.0]
I argue that there is an underappreciated risk in the slow and irrevocable decline of human autonomy.
What may follow is a process of gradual de-skilling, where we lose skills that we currently take for granted.
The biggest threat to humanity is not that machines will become more like humans, but that humans will become more like machines.
arXiv Detail & Related papers (2025-03-28T05:10:32Z) - Modeling Human Beliefs about AI Behavior for Scalable Oversight [15.535954576226207]
As AI systems grow more capable, human feedback becomes increasingly unreliable.
This raises the problem of scalable oversight: How can we supervise AI systems that exceed human capabilities?
We propose to model the human evaluator's beliefs about the AI system's behavior to better interpret the human's feedback.
arXiv Detail & Related papers (2025-02-28T17:39:55Z) - Human Bias in the Face of AI: The Role of Human Judgement in AI Generated Text Evaluation [48.70176791365903]
This study explores how bias shapes the perception of AI versus human generated content.
We investigated how human raters respond to labeled and unlabeled content.
arXiv Detail & Related papers (2024-09-29T04:31:45Z) - Rolling in the deep of cognitive and AI biases [1.556153237434314]
We argue that there is urgent need to understand AI as a sociotechnical system, inseparable from the conditions in which it is designed, developed and deployed.
We address this critical issue by following a radical new methodology under which human cognitive biases become core entities in our AI fairness overview.
We introduce a new mapping, which justifies the humans to AI biases and we detect relevant fairness intensities and inter-dependencies.
arXiv Detail & Related papers (2024-07-30T21:34:04Z) - Learning Human-like Representations to Enable Learning Human Values [11.236150405125754]
We explore the effects of representational alignment between humans and AI agents on learning human values.
We show that this kind of representational alignment can support safely learning and exploring human values in the context of personalization.
arXiv Detail & Related papers (2023-12-21T18:31:33Z) - Close the Gates: How we can keep the future human by choosing not to develop superhuman general-purpose artificial intelligence [0.20919309330073077]
In the coming years, humanity may irreversibly cross a threshold by creating general-purpose AI.
This would upend core aspects of human society, present many unprecedented risks, and is likely to be uncontrollable in several senses.
We can choose to not do so, starting by instituting hard limits on the computation that can be used to train and run neural networks.
With these limits in place, AI research and industry can focus on making both narrow and general-purpose AI that humans can understand and control, and from which we can reap enormous benefit.
arXiv Detail & Related papers (2023-11-15T23:41:12Z) - Exploration with Principles for Diverse AI Supervision [88.61687950039662]
Training large transformers using next-token prediction has given rise to groundbreaking advancements in AI.
While this generative AI approach has produced impressive results, it heavily leans on human supervision.
This strong reliance on human oversight poses a significant hurdle to the advancement of AI innovation.
We propose a novel paradigm termed Exploratory AI (EAI) aimed at autonomously generating high-quality training data.
arXiv Detail & Related papers (2023-10-13T07:03:39Z) - Intent-aligned AI systems deplete human agency: the need for agency
foundations research in AI safety [2.3572498744567127]
We argue that alignment to human intent is insufficient for safe AI systems.
We argue that preservation of long-term agency of humans may be a more robust standard.
arXiv Detail & Related papers (2023-05-30T17:14:01Z) - When to Make Exceptions: Exploring Language Models as Accounts of Human
Moral Judgment [96.77970239683475]
AI systems need to be able to understand, interpret and predict human moral judgments and decisions.
A central challenge for AI safety is capturing the flexibility of the human moral mind.
We present a novel challenge set consisting of rule-breaking question answering.
arXiv Detail & Related papers (2022-10-04T09:04:27Z) - Cybertrust: From Explainable to Actionable and Interpretable AI (AI2) [58.981120701284816]
Actionable and Interpretable AI (AI2) will incorporate explicit quantifications and visualizations of user confidence in AI recommendations.
It will allow examining and testing of AI system predictions to establish a basis for trust in the systems' decision making.
arXiv Detail & Related papers (2022-01-26T18:53:09Z) - Trustworthy AI: A Computational Perspective [54.80482955088197]
We focus on six of the most crucial dimensions in achieving trustworthy AI: (i) Safety & Robustness, (ii) Non-discrimination & Fairness, (iii) Explainability, (iv) Privacy, (v) Accountability & Auditability, and (vi) Environmental Well-Being.
For each dimension, we review the recent related technologies according to a taxonomy and summarize their applications in real-world systems.
arXiv Detail & Related papers (2021-07-12T14:21:46Z) - Dynamic Cognition Applied to Value Learning in Artificial Intelligence [0.0]
Several researchers in the area are trying to develop a robust, beneficial, and safe concept of artificial intelligence.
It is of utmost importance that artificial intelligent agents have their values aligned with human values.
A possible approach to this problem would be to use theoretical models such as SED.
arXiv Detail & Related papers (2020-05-12T03:58:52Z) - Effect of Confidence and Explanation on Accuracy and Trust Calibration
in AI-Assisted Decision Making [53.62514158534574]
We study whether features that reveal case-specific model information can calibrate trust and improve the joint performance of the human and AI.
We show that confidence score can help calibrate people's trust in an AI model, but trust calibration alone is not sufficient to improve AI-assisted decision making.
arXiv Detail & Related papers (2020-01-07T15:33:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.