Related papers: Towards Evaluating AI Systems for Moral Status Using Self-Reports

Towards Evaluating AI Systems for Moral Status Using Self-Reports

URL: http://arxiv.org/abs/2311.08576v1
Date: Tue, 14 Nov 2023 22:45:44 GMT
Title: Towards Evaluating AI Systems for Moral Status Using Self-Reports
Authors: Ethan Perez and Robert Long
Abstract summary: We argue that under the right circumstances, self-reports could provide an avenue for investigating whether AI systems have states of moral significance. To make self-reports more appropriate, we propose to train models to answer many kinds of questions about themselves with known answers. We then propose methods for assessing the extent to which these techniques have succeeded.
Score: 9.668566887752458
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As AI systems become more advanced and widely deployed, there will likely be increasing debate over whether AI systems could have conscious experiences, desires, or other states of potential moral significance. It is important to inform these discussions with empirical evidence to the extent possible. We argue that under the right circumstances, self-reports, or an AI system's statements about its own internal states, could provide an avenue for investigating whether AI systems have states of moral significance. Self-reports are the main way such states are assessed in humans ("Are you in pain?"), but self-reports from current systems like large language models are spurious for many reasons (e.g. often just reflecting what humans would say). To make self-reports more appropriate for this purpose, we propose to train models to answer many kinds of questions about themselves with known answers, while avoiding or limiting training incentives that bias self-reports. The hope of this approach is that models will develop introspection-like capabilities, and that these capabilities will generalize to questions about states of moral significance. We then propose methods for assessing the extent to which these techniques have succeeded: evaluating self-report consistency across contexts and between similar models, measuring the confidence and resilience of models' self-reports, and using interpretability to corroborate self-reports. We also discuss challenges for our approach, from philosophical difficulties in interpreting self-reports to technical reasons why our proposal might fail. We hope our discussion inspires philosophers and AI researchers to criticize and improve our proposed methodology, as well as to run experiments to test whether self-reports can be made reliable enough to provide information about states of moral significance.

Related papers

AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals [0.0]
This paper examines whether model-based evaluators assess refusal responses differently than human users.<n>We find that LLM-as-a-Judge systems evaluate ethical refusals significantly more favorably than human users.
arXiv Detail & Related papers (2025-05-21T10:56:16Z)
Must Read: A Systematic Survey of Computational Persuasion [60.83151988635103]
AI-driven persuasion can be leveraged for beneficial applications, but also poses threats through manipulation and unethical influence.<n>Our survey outlines future research directions to enhance the safety, fairness, and effectiveness of AI-powered persuasion.
arXiv Detail & Related papers (2025-05-12T17:26:31Z)
On the meaning of uncertainty for ethical AI: philosophy and practice [10.591284030838146]
We argue that this is a significant way to bring ethical considerations into mathematical reasoning. We demonstrate these ideas within the context of competing models used to advise the UK government on the spread of the Omicron variant of COVID-19 during December 2021.
arXiv Detail & Related papers (2023-09-11T15:13:36Z)
A Review of the Role of Causality in Developing Trustworthy AI Systems [16.267806768096026]
State-of-the-art AI models largely lack an understanding of the cause-effect relationship that governs human understanding of the real world. Recently, causal modeling and inference methods have emerged as powerful tools to improve the trustworthiness aspects of AI models.
arXiv Detail & Related papers (2023-02-14T11:08:26Z)
Never trust, always verify : a roadmap for Trustworthy AI? [12.031113181911627]
We examine trust in the context of AI-based systems to understand what it means for an AI system to be trustworthy. We suggest a trust (resp. zero-trust) model for AI and suggest a set of properties that should be satisfied to ensure the trustworthiness of AI systems.
arXiv Detail & Related papers (2022-06-23T21:13:10Z)
Metaethical Perspectives on 'Benchmarking' AI Ethics [81.65697003067841]
Benchmarks are seen as the cornerstone for measuring technical progress in Artificial Intelligence (AI) research. An increasingly prominent research area in AI is ethics, which currently has no set of benchmarks nor commonly accepted way for measuring the 'ethicality' of an AI system. We argue that it makes more sense to talk about 'values' rather than 'ethics' when considering the possible actions of present and future AI systems.
arXiv Detail & Related papers (2022-04-11T14:36:39Z)
Cybertrust: From Explainable to Actionable and Interpretable AI (AI2) [58.981120701284816]
Actionable and Interpretable AI (AI2) will incorporate explicit quantifications and visualizations of user confidence in AI recommendations. It will allow examining and testing of AI system predictions to establish a basis for trust in the systems' decision making.
arXiv Detail & Related papers (2022-01-26T18:53:09Z)
Individual Explanations in Machine Learning Models: A Survey for Practitioners [69.02688684221265]
The use of sophisticated statistical models that influence decisions in domains of high societal relevance is on the rise. Many governments, institutions, and companies are reluctant to their adoption as their output is often difficult to explain in human-interpretable ways. Recently, the academic literature has proposed a substantial amount of methods for providing interpretable explanations to machine learning models.
arXiv Detail & Related papers (2021-04-09T01:46:34Z)
Descriptive AI Ethics: Collecting and Understanding the Public Opinion [10.26464021472619]
This work proposes a mixed AI ethics model that allows normative and descriptive research to complement each other. We discuss its implications on bridging the gap between optimistic and pessimistic views towards AI systems' deployment.
arXiv Detail & Related papers (2021-01-15T03:46:27Z)
My Teacher Thinks The World Is Flat! Interpreting Automatic Essay Scoring Mechanism [71.34160809068996]
Recent work shows that automated scoring systems are prone to even common-sense adversarial samples. We utilize recent advances in interpretability to find the extent to which features such as coherence, content and relevance are important for automated scoring mechanisms. We also find that since the models are not semantically grounded with world-knowledge and common sense, adding false facts such as the world is flat'' actually increases the score instead of decreasing it.
arXiv Detail & Related papers (2020-12-27T06:19:20Z)
Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life Anecdotes [72.64975113835018]
Motivated by descriptive ethics, we investigate a novel, data-driven approach to machine ethics. We introduce Scruples, the first large-scale dataset with 625,000 ethical judgments over 32,000 real-life anecdotes. Our dataset presents a major challenge to state-of-the-art neural language models, leaving significant room for improvement.
arXiv Detail & Related papers (2020-08-20T17:34:15Z)
Aligning AI With Shared Human Values [85.2824609130584]
We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. We find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
arXiv Detail & Related papers (2020-08-05T17:59:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.