Related papers: Model-Agnostic Policy Explanations with Large Language Models

Model-Agnostic Policy Explanations with Large Language Models

URL: http://arxiv.org/abs/2504.05625v1
Date: Tue, 08 Apr 2025 02:56:02 GMT
Title: Model-Agnostic Policy Explanations with Large Language Models
Authors: Zhang Xi-Jia, Yue Guo, Shufei Chen, Simon Stepputtis, Matthew Gombolay, Katia Sycara, Joseph Campbell,
Abstract summary: We propose a method for generating natural language explanations of agent behavior based only on observed states and actions.<n>Our approach learns a locally interpretable surrogate model of the agent's behavior from observations.<n>We find that participants in a user study more accurately predicted the agent's future actions when given our explanations.
Score: 6.405870799906393
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Intelligent agents, such as robots, are increasingly deployed in real-world, human-centric environments. To foster appropriate human trust and meet legal and ethical standards, these agents must be able to explain their behavior. However, state-of-the-art agents are typically driven by black-box models like deep neural networks, limiting their interpretability. We propose a method for generating natural language explanations of agent behavior based only on observed states and actions -- without access to the agent's underlying model. Our approach learns a locally interpretable surrogate model of the agent's behavior from observations, which then guides a large language model to generate plausible explanations with minimal hallucination. Empirical results show that our method produces explanations that are more comprehensible and correct than those from baselines, as judged by both language models and human evaluators. Furthermore, we find that participants in a user study more accurately predicted the agent's future actions when given our explanations, suggesting improved understanding of agent behavior.

Related papers

Towards Explainable Goal Recognition Using Weight of Evidence (WoE): A Human-Centered Approach [5.174712539403376]
Goal recognition (GR) involves inferring an agent's unobserved goal from a sequence of observations. Traditionally, GR has been addressed using 'inference to the best explanation' or abduction. We introduce and evaluate an explainable model for goal recognition (GR) agents, grounded in the theoretical framework and cognitive processes underlying human behavior explanation.
arXiv Detail & Related papers (2024-09-18T03:30:01Z)
Evaluating the Utility of Model Explanations for Model Development [54.23538543168767]
We evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development. To our surprise, we did not find evidence of significant improvement on tasks when users were provided with any of the saliency maps. These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
arXiv Detail & Related papers (2023-12-10T23:13:23Z)
Understanding Your Agent: Leveraging Large Language Models for Behavior Explanation [7.647395374489533]
We propose an approach to generate natural language explanations for an agent's behavior based only on observations of states and actions. We show that our approach generates explanations as helpful as those produced by a human domain expert.
arXiv Detail & Related papers (2023-11-29T20:16:23Z)
Explaining Agent Behavior with Large Language Models [7.128139268426959]
We propose an approach to generate natural language explanations for an agent's behavior based only on observations of states and actions. We show how a compact representation of the agent's behavior can be learned and used to produce plausible explanations.
arXiv Detail & Related papers (2023-09-19T06:13:24Z)
Explainability for Large Language Models: A Survey [59.67574757137078]
Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
arXiv Detail & Related papers (2023-09-02T22:14:26Z)
Behavioral Analysis of Vision-and-Language Navigation Agents [21.31684388423088]
Vision-and-Language Navigation (VLN) agents must be able to ground instructions to actions based on surroundings. We develop a methodology to study agent behavior on a skill-specific basis.
arXiv Detail & Related papers (2023-07-20T11:42:24Z)
Commonsense Knowledge Transfer for Pre-trained Language Models [83.01121484432801]
We introduce commonsense knowledge transfer, a framework to transfer the commonsense knowledge stored in a neural commonsense knowledge model to a general-purpose pre-trained language model. It first exploits general texts to form queries for extracting commonsense knowledge from the neural commonsense knowledge model. It then refines the language model with two self-supervised objectives: commonsense mask infilling and commonsense relation prediction.
arXiv Detail & Related papers (2023-06-04T15:44:51Z)
Probing via Prompting [71.7904179689271]
This paper introduces a novel model-free approach to probing, by formulating probing as a prompting task. We conduct experiments on five probing tasks and show that our approach is comparable or better at extracting information than diagnostic probes. We then examine the usefulness of a specific linguistic property for pre-training by removing the heads that are essential to that property and evaluating the resulting model's performance on language modeling.
arXiv Detail & Related papers (2022-07-04T22:14:40Z)
Learning Theory of Mind via Dynamic Traits Attribution [59.9781556714202]
We propose a new neural ToM architecture that learns to generate a latent trait vector of an actor from the past trajectories. This trait vector then multiplicatively modulates the prediction mechanism via a fast weights' scheme in the prediction neural network. We empirically show that the fast weights provide a good inductive bias to model the character traits of agents and hence improves mindreading ability.
arXiv Detail & Related papers (2022-04-17T11:21:18Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
Agent Modelling under Partial Observability for Deep Reinforcement Learning [12.903487594031276]
Existing methods for agent modelling assume knowledge of the local observations and chosen actions of the modelled agents during execution. We learn to extract representations about the modelled agents conditioned only on the local observations of the controlled agent. The representations are used to augment the controlled agent's decision policy which is trained via deep reinforcement learning.
arXiv Detail & Related papers (2020-06-16T18:43:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.