Minimum Levels of Interpretability for Artificial Moral Agents
- URL: http://arxiv.org/abs/2307.00660v1
- Date: Sun, 2 Jul 2023 20:27:55 GMT
- Title: Minimum Levels of Interpretability for Artificial Moral Agents
- Authors: Avish Vijayaraghavan, Cosmin Badea
- Abstract summary: For models involved in moral decision-making, also known as artificial moral agents (AMA), interpretability provides a way to trust and understand the agent's internal reasoning mechanisms for effective use and error correction.
We introduce the concept of the Minimum Level of Interpretability (MLI) and recommend an MLI for various types of agents, to aid their safe deployment in real-world settings.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As artificial intelligence (AI) models continue to scale up, they are
becoming more capable and integrated into various forms of decision-making
systems. For models involved in moral decision-making, also known as artificial
moral agents (AMA), interpretability provides a way to trust and understand the
agent's internal reasoning mechanisms for effective use and error correction.
In this paper, we provide an overview of this rapidly-evolving sub-field of AI
interpretability, introduce the concept of the Minimum Level of
Interpretability (MLI) and recommend an MLI for various types of agents, to aid
their safe deployment in real-world settings.
Related papers
- Evolution of AI in Education: Agentic Workflows [2.1681971652284857]
Artificial intelligence (AI) has transformed various aspects of education.
Large language models (LLMs) are driving advancements in automated tutoring, assessment, and content generation.
To address these limitations and foster more sustainable technological practices, AI agents have emerged as a promising new avenue for educational innovation.
arXiv Detail & Related papers (2025-04-25T13:44:57Z) - Do LLMs trust AI regulation? Emerging behaviour of game-theoretic LLM agents [61.132523071109354]
This paper investigates the interplay between AI developers, regulators and users, modelling their strategic choices under different regulatory scenarios.
Our research identifies emerging behaviours of strategic AI agents, which tend to adopt more "pessimistic" stances than pure game-theoretic agents.
arXiv Detail & Related papers (2025-04-11T15:41:21Z) - AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability [84.52205243353761]
Recent work proposes using world models to generate controlled virtual environments in which AI agents can be tested before deployment.
We investigate ways of simplifying world models that remain agnostic to the AI agent under evaluation.
arXiv Detail & Related papers (2025-04-06T20:35:44Z) - Explainable artificial intelligence (XAI): from inherent explainability to large language models [0.0]
Explainable AI (XAI) techniques facilitate the explainability or interpretability of machine learning models.
This paper details the advancements of explainable AI methods, from inherently interpretable models to modern approaches.
We review explainable AI techniques that leverage vision-language model (VLM) frameworks to automate or improve the explainability of other machine learning models.
arXiv Detail & Related papers (2025-01-17T06:16:57Z) - Causal Responsibility Attribution for Human-AI Collaboration [62.474732677086855]
This paper presents a causal framework using Structural Causal Models (SCMs) to systematically attribute responsibility in human-AI systems.
Two case studies illustrate the framework's adaptability in diverse human-AI collaboration scenarios.
arXiv Detail & Related papers (2024-11-05T17:17:45Z) - Engineering Trustworthy AI: A Developer Guide for Empirical Risk Minimization [53.80919781981027]
Key requirements for trustworthy AI can be translated into design choices for the components of empirical risk minimization.
We hope to provide actionable guidance for building AI systems that meet emerging standards for trustworthiness of AI.
arXiv Detail & Related papers (2024-10-25T07:53:32Z) - Interpretable Rule-Based System for Radar-Based Gesture Sensing: Enhancing Transparency and Personalization in AI [2.99664686845172]
We introduce MIRA, a transparent and interpretable multi-class rule-based algorithm tailored for radar-based gesture detection.
We showcase the system's adaptability through personalized rule sets that calibrate to individual user behavior, offering a user-centric AI experience.
Our research underscores MIRA's ability to deliver both high interpretability and performance and emphasizes the potential for broader adoption of interpretable AI in safety-critical applications.
arXiv Detail & Related papers (2024-09-30T16:40:27Z) - Explanation, Debate, Align: A Weak-to-Strong Framework for Language Model Generalization [0.6629765271909505]
This paper introduces a novel approach to model alignment through weak-to-strong generalization in the context of language models.
Our results suggest that this facilitation-based approach not only enhances model performance but also provides insights into the nature of model alignment.
arXiv Detail & Related papers (2024-09-11T15:16:25Z) - Tuning-Free Accountable Intervention for LLM Deployment -- A
Metacognitive Approach [55.613461060997004]
Large Language Models (LLMs) have catalyzed transformative advances across a spectrum of natural language processing tasks.
We propose an innovative textitmetacognitive approach, dubbed textbfCLEAR, to equip LLMs with capabilities for self-aware error identification and correction.
arXiv Detail & Related papers (2024-03-08T19:18:53Z) - Position Paper: Agent AI Towards a Holistic Intelligence [53.35971598180146]
We emphasize developing Agent AI -- an embodied system that integrates large foundation models into agent actions.
In this paper, we propose a novel large action model to achieve embodied intelligent behavior, the Agent Foundation Model.
arXiv Detail & Related papers (2024-02-28T16:09:56Z) - Measuring Value Alignment [12.696227679697493]
This paper introduces a novel formalism to quantify the alignment between AI systems and human values.
By utilizing this formalism, AI developers and ethicists can better design and evaluate AI systems to ensure they operate in harmony with human values.
arXiv Detail & Related papers (2023-12-23T12:30:06Z) - Hybrid Approaches for Moral Value Alignment in AI Agents: a Manifesto [3.7414804164475983]
Increasing interest in ensuring the safety of next-generation Artificial Intelligence (AI) systems calls for novel approaches to embedding morality into autonomous agents.
We provide a systematization of existing approaches to the problem of introducing morality in machines - modelled as a continuum.
We argue that more hybrid solutions are needed to create adaptable and robust, yet controllable and interpretable agentic systems.
arXiv Detail & Related papers (2023-12-04T11:46:34Z) - Evaluating Explainability in Machine Learning Predictions through Explainer-Agnostic Metrics [0.0]
We develop six distinct model-agnostic metrics designed to quantify the extent to which model predictions can be explained.
These metrics measure different aspects of model explainability, ranging from local importance, global importance, and surrogate predictions.
We demonstrate the practical utility of these metrics on classification and regression tasks, and integrate these metrics into an existing Python package for public use.
arXiv Detail & Related papers (2023-02-23T15:28:36Z) - DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models.
Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z) - An interdisciplinary conceptual study of Artificial Intelligence (AI)
for helping benefit-risk assessment practices: Towards a comprehensive
qualification matrix of AI programs and devices (pre-print 2020) [55.41644538483948]
This paper proposes a comprehensive analysis of existing concepts coming from different disciplines tackling the notion of intelligence.
The aim is to identify shared notions or discrepancies to consider for qualifying AI systems.
arXiv Detail & Related papers (2021-05-07T12:01:31Z) - Modelling Agent Policies with Interpretable Imitation Learning [12.858982225307809]
We outline an approach to imitation learning for reverse-engineering black box agent policies in MDP environments.
We explicitly model and learn agents' latent state representations by selecting from a large space of candidate features constructed from the Markov state.
arXiv Detail & Related papers (2020-06-19T18:19:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.