Related papers: Circumventing interpretability: How to defeat mind-readers

Circumventing interpretability: How to defeat mind-readers

URL: http://arxiv.org/abs/2212.11415v1
Date: Wed, 21 Dec 2022 23:52:42 GMT
Title: Circumventing interpretability: How to defeat mind-readers
Authors: Lee Sharkey
Abstract summary: misaligned artificial intelligence will have a convergent instrumental incentive to make its thoughts difficult for us to interpret. I discuss many ways that a capable AI might circumvent scalable interpretability methods and suggest a framework for thinking about these potential future risks.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The increasing capabilities of artificial intelligence (AI) systems make it ever more important that we interpret their internals to ensure that their intentions are aligned with human values. Yet there is reason to believe that misaligned artificial intelligence will have a convergent instrumental incentive to make its thoughts difficult for us to interpret. In this article, I discuss many ways that a capable AI might circumvent scalable interpretability methods and suggest a framework for thinking about these potential future risks.

Related papers

Propositional Interpretability in Artificial Intelligence [0.0]
I argue for the importance of propositional interpretability, which involves interpreting a system's mechanisms and behavior in terms of propositional attitudes. A central challenge is what I call thought logging: creating systems that log all of the relevant propositional attitudes in an AI system over time.
arXiv Detail & Related papers (2025-01-27T03:06:06Z)
Imagining and building wise machines: The centrality of AI metacognition [78.76893632793497]
We argue that shortcomings stem from one overarching failure: AI systems lack wisdom. While AI research has focused on task-level strategies, metacognition is underdeveloped in AI systems. We propose that integrating metacognitive capabilities into AI systems is crucial for enhancing their robustness, explainability, cooperation, and safety.
arXiv Detail & Related papers (2024-11-04T18:10:10Z)
Combining AI Control Systems and Human Decision Support via Robustness and Criticality [53.10194953873209]
We extend a methodology for adversarial explanations (AE) to state-of-the-art reinforcement learning frameworks. We show that the learned AI control system demonstrates robustness against adversarial tampering. In a training / learning framework, this technology can improve both the AI's decisions and explanations through human interaction.
arXiv Detail & Related papers (2024-07-03T15:38:57Z)
Artificial Intelligence: Arguments for Catastrophic Risk [0.0]
We review two influential arguments purporting to show how AI could pose catastrophic risks. The first argument -- the Problem of Power-Seeking -- claims that advanced AI systems are likely to engage in dangerous power-seeking behavior. The second argument claims that the development of human-level AI will unlock rapid further progress.
arXiv Detail & Related papers (2024-01-27T19:34:13Z)
A Review on Objective-Driven Artificial Intelligence [0.0]
Humans have an innate ability to understand context, nuances, and subtle cues in communication. Humans possess a vast repository of common-sense knowledge that helps us make logical inferences and predictions about the world. Machines lack this innate understanding and often struggle with making sense of situations that humans find trivial.
arXiv Detail & Related papers (2023-08-20T02:07:42Z)
Seamful XAI: Operationalizing Seamful Design in Explainable AI [59.89011292395202]
Mistakes in AI systems are inevitable, arising from both technical limitations and sociotechnical gaps. We propose that seamful design can foster AI explainability by revealing sociotechnical and infrastructural mismatches. We explore this process with 43 AI practitioners and real end-users.
arXiv Detail & Related papers (2022-11-12T21:54:05Z)
Cybertrust: From Explainable to Actionable and Interpretable AI (AI2) [58.981120701284816]
Actionable and Interpretable AI (AI2) will incorporate explicit quantifications and visualizations of user confidence in AI recommendations. It will allow examining and testing of AI system predictions to establish a basis for trust in the systems' decision making.
arXiv Detail & Related papers (2022-01-26T18:53:09Z)
Trustworthy AI: A Computational Perspective [54.80482955088197]
We focus on six of the most crucial dimensions in achieving trustworthy AI: (i) Safety & Robustness, (ii) Non-discrimination & Fairness, (iii) Explainability, (iv) Privacy, (v) Accountability & Auditability, and (vi) Environmental Well-Being. For each dimension, we review the recent related technologies according to a taxonomy and summarize their applications in real-world systems.
arXiv Detail & Related papers (2021-07-12T14:21:46Z)
Building Bridges: Generative Artworks to Explore AI Ethics [56.058588908294446]
In recent years, there has been an increased emphasis on understanding and mitigating adverse impacts of artificial intelligence (AI) technologies on society. A significant challenge in the design of ethical AI systems is that there are multiple stakeholders in the AI pipeline, each with their own set of constraints and interests. This position paper outlines some potential ways in which generative artworks can play this role by serving as accessible and powerful educational tools.
arXiv Detail & Related papers (2021-06-25T22:31:55Z)
Dynamic Cognition Applied to Value Learning in Artificial Intelligence [0.0]
Several researchers in the area are trying to develop a robust, beneficial, and safe concept of artificial intelligence. It is of utmost importance that artificial intelligent agents have their values aligned with human values. A possible approach to this problem would be to use theoretical models such as SED.
arXiv Detail & Related papers (2020-05-12T03:58:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.