Circumventing interpretability: How to defeat mind-readers
- URL: http://arxiv.org/abs/2212.11415v1
- Date: Wed, 21 Dec 2022 23:52:42 GMT
- Title: Circumventing interpretability: How to defeat mind-readers
- Authors: Lee Sharkey
- Abstract summary: misaligned artificial intelligence will have a convergent instrumental incentive to make its thoughts difficult for us to interpret.
I discuss many ways that a capable AI might circumvent scalable interpretability methods and suggest a framework for thinking about these potential future risks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increasing capabilities of artificial intelligence (AI) systems make it
ever more important that we interpret their internals to ensure that their
intentions are aligned with human values. Yet there is reason to believe that
misaligned artificial intelligence will have a convergent instrumental
incentive to make its thoughts difficult for us to interpret. In this article,
I discuss many ways that a capable AI might circumvent scalable
interpretability methods and suggest a framework for thinking about these
potential future risks.
Related papers
- Combining AI Control Systems and Human Decision Support via Robustness and Criticality [53.10194953873209]
We extend a methodology for adversarial explanations (AE) to state-of-the-art reinforcement learning frameworks.
We show that the learned AI control system demonstrates robustness against adversarial tampering.
In a training / learning framework, this technology can improve both the AI's decisions and explanations through human interaction.
arXiv Detail & Related papers (2024-07-03T15:38:57Z) - Position Paper: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience [4.524832437237367]
Inner Interpretability is a promising field tasked with uncovering the inner mechanisms of AI systems.
Recent critiques raise issues that question its usefulness to advance the broader goals of AI.
Here we draw the relevant connections and highlight lessons that can be transferred productively between fields.
arXiv Detail & Related papers (2024-06-03T14:16:56Z) - Artificial Intelligence: Arguments for Catastrophic Risk [0.0]
We review two influential arguments purporting to show how AI could pose catastrophic risks.
The first argument -- the Problem of Power-Seeking -- claims that advanced AI systems are likely to engage in dangerous power-seeking behavior.
The second argument claims that the development of human-level AI will unlock rapid further progress.
arXiv Detail & Related papers (2024-01-27T19:34:13Z) - A Review on Objective-Driven Artificial Intelligence [0.0]
Humans have an innate ability to understand context, nuances, and subtle cues in communication.
Humans possess a vast repository of common-sense knowledge that helps us make logical inferences and predictions about the world.
Machines lack this innate understanding and often struggle with making sense of situations that humans find trivial.
arXiv Detail & Related papers (2023-08-20T02:07:42Z) - Seamful XAI: Operationalizing Seamful Design in Explainable AI [59.89011292395202]
Mistakes in AI systems are inevitable, arising from both technical limitations and sociotechnical gaps.
We propose that seamful design can foster AI explainability by revealing sociotechnical and infrastructural mismatches.
We explore this process with 43 AI practitioners and real end-users.
arXiv Detail & Related papers (2022-11-12T21:54:05Z) - Cybertrust: From Explainable to Actionable and Interpretable AI (AI2) [58.981120701284816]
Actionable and Interpretable AI (AI2) will incorporate explicit quantifications and visualizations of user confidence in AI recommendations.
It will allow examining and testing of AI system predictions to establish a basis for trust in the systems' decision making.
arXiv Detail & Related papers (2022-01-26T18:53:09Z) - The Who in XAI: How AI Background Shapes Perceptions of AI Explanations [61.49776160925216]
We conduct a mixed-methods study of how two different groups--people with and without AI background--perceive different types of AI explanations.
We find that (1) both groups showed unwarranted faith in numbers for different reasons and (2) each group found value in different explanations beyond their intended design.
arXiv Detail & Related papers (2021-07-28T17:32:04Z) - Trustworthy AI: A Computational Perspective [54.80482955088197]
We focus on six of the most crucial dimensions in achieving trustworthy AI: (i) Safety & Robustness, (ii) Non-discrimination & Fairness, (iii) Explainability, (iv) Privacy, (v) Accountability & Auditability, and (vi) Environmental Well-Being.
For each dimension, we review the recent related technologies according to a taxonomy and summarize their applications in real-world systems.
arXiv Detail & Related papers (2021-07-12T14:21:46Z) - Building Bridges: Generative Artworks to Explore AI Ethics [56.058588908294446]
In recent years, there has been an increased emphasis on understanding and mitigating adverse impacts of artificial intelligence (AI) technologies on society.
A significant challenge in the design of ethical AI systems is that there are multiple stakeholders in the AI pipeline, each with their own set of constraints and interests.
This position paper outlines some potential ways in which generative artworks can play this role by serving as accessible and powerful educational tools.
arXiv Detail & Related papers (2021-06-25T22:31:55Z) - Dynamic Cognition Applied to Value Learning in Artificial Intelligence [0.0]
Several researchers in the area are trying to develop a robust, beneficial, and safe concept of artificial intelligence.
It is of utmost importance that artificial intelligent agents have their values aligned with human values.
A possible approach to this problem would be to use theoretical models such as SED.
arXiv Detail & Related papers (2020-05-12T03:58:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.