Related papers: A Theoretical Framework for Explaining Reinforcement Learning with Shapley Values

A Theoretical Framework for Explaining Reinforcement Learning with Shapley Values

URL: http://arxiv.org/abs/2505.07797v1
Date: Mon, 12 May 2025 17:48:28 GMT
Title: A Theoretical Framework for Explaining Reinforcement Learning with Shapley Values
Authors: Daniel Beechey, Thomas M. S. Smith, Özgür Şimşek,
Abstract summary: We develop a theoretical framework for explaining reinforcement learning through the influence of state features.<n>We identify three core elements of the agent-environment interaction that benefit from explanation.<n>This approach yields a family of mathematically grounded explanations with clear semantics and theoretical guarantees.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning agents can achieve superhuman performance, but their decisions are often difficult to interpret. This lack of transparency limits deployment, especially in safety-critical settings where human trust and accountability are essential. In this work, we develop a theoretical framework for explaining reinforcement learning through the influence of state features, which represent what the agent observes in its environment. We identify three core elements of the agent-environment interaction that benefit from explanation: behaviour (what the agent does), performance (what the agent achieves), and value estimation (what the agent expects to achieve). We treat state features as players cooperating to produce each element and apply Shapley values, a principled method from cooperative game theory, to identify the influence of each feature. This approach yields a family of mathematically grounded explanations with clear semantics and theoretical guarantees. We use illustrative examples to show how these explanations align with human intuition and reveal novel insights. Our framework unifies and extends prior work, making explicit the assumptions behind existing approaches, and offers a principled foundation for more interpretable and trustworthy reinforcement learning.

Related papers

Unveiling the Learning Mind of Language Models: A Cognitive Framework and Empirical Study [50.065744358362345]
Large language models (LLMs) have shown impressive capabilities across tasks such as mathematics, coding, and reasoning.<n>Yet their learning ability, which is crucial for adapting to dynamic environments and acquiring new knowledge, remains underexplored.
arXiv Detail & Related papers (2025-06-16T13:24:50Z)
Generating Causal Explanations of Vehicular Agent Behavioural Interactions with Learnt Reward Profiles [13.450023647228843]
We learn a weighting of reward metrics for agents such that explanations for agent interactions can be causally inferred.<n>We validate our approach quantitatively and qualitatively across three real-world driving datasets.
arXiv Detail & Related papers (2025-03-18T01:53:59Z)
Learning to Assist Humans without Inferring Rewards [65.28156318196397]
We build upon prior work that studies assistance through the lens of empowerment.<n>An assistive agent aims to maximize the influence of the human's actions.<n>We prove that these representations estimate a similar notion of empowerment to that studied by prior work.
arXiv Detail & Related papers (2024-11-04T21:31:04Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
REVEAL-IT: REinforcement learning with Visibility of Evolving Agent poLicy for InTerpretability [23.81322529587759]
REVEAL-IT is a novel framework for explaining the learning process of an agent in complex environments. We visualize the policy structure and the agent's learning process for various training tasks. A GNN-based explainer learns to highlight the most important section of the policy, providing a more clear and robust explanation of the agent's learning process.
arXiv Detail & Related papers (2024-06-20T11:29:26Z)
Learning to Generate Explainable Stock Predictions using Self-Reflective Large Language Models [54.21695754082441]
We propose a framework to teach Large Language Models (LLMs) to generate explainable stock predictions. A reflective agent learns how to explain past stock movements through self-reasoning, while the PPO trainer trains the model to generate the most likely explanations. Our framework can outperform both traditional deep-learning and LLM methods in prediction accuracy and Matthews correlation coefficient.
arXiv Detail & Related papers (2024-02-06T03:18:58Z)
Learning by Self-Explaining [23.420673675343266]
We introduce a novel workflow in the context of image classification, termed Learning by Self-Explaining (LSX) LSX utilizes aspects of self-refining AI and human-guided explanatory machine learning. Our results indicate improvements via Learning by Self-Explaining on several levels.
arXiv Detail & Related papers (2023-09-15T13:41:57Z)
Inverse Reinforcement Learning for Text Summarization [52.765898203824975]
We introduce inverse reinforcement learning (IRL) as an effective paradigm for training abstractive summarization models. Experimental results across datasets in different domains demonstrate the superiority of our proposed IRL model for summarization over MLE and RL baselines.
arXiv Detail & Related papers (2022-12-19T23:45:05Z)
Complementary Explanations for Effective In-Context Learning [77.83124315634386]
Large language models (LLMs) have exhibited remarkable capabilities in learning from explanations in prompts. This work aims to better understand the mechanisms by which explanations are used for in-context learning.
arXiv Detail & Related papers (2022-11-25T04:40:47Z)
Learning Rhetorical Structure Theory-based descriptions of observed behaviour [0.5249805590164901]
This paper proposes a new set of concepts, axiom schemata and algorithms that allow the agent to learn new descriptions of an observed behaviour. The relations used by agents to represent the descriptions they learn were inspired on the Theory of Rhetorical Structure (RST) The paper shows results of the presented proposals in a demonstration scenario, using implemented software.
arXiv Detail & Related papers (2022-06-24T13:47:20Z)
Explaining, Evaluating and Enhancing Neural Networks' Learned Representations [2.1485350418225244]
We show how explainability can be an aid, rather than an obstacle, towards better and more efficient representations. We employ such attributions to define two novel scores for evaluating the informativeness and the disentanglement of latent embeddings. We show that adopting our proposed scores as constraints during the training of a representation learning task improves the downstream performance of the model.
arXiv Detail & Related papers (2022-02-18T19:00:01Z)
RELAX: Representation Learning Explainability [10.831313203043514]
We propose RELAX, which is the first approach for attribution-based explanations of representations. ReLAX explains representations by measuring similarities in the representation space between an input and masked out versions of itself. We provide theoretical interpretations of RELAX and conduct a novel analysis of feature extractors trained using supervised and unsupervised learning.
arXiv Detail & Related papers (2021-12-19T14:51:31Z)
Tell me why! -- Explanations support learning of relational and causal structure [24.434551113103105]
Explanations play a considerable role in human learning, especially in areas that remain major challenges for AI. We show that reinforcement learning agents might likewise benefit from explanations. Our results suggest that learning from explanations is a powerful principle that could offer a promising path towards training more robust and general machine learning systems.
arXiv Detail & Related papers (2021-12-07T15:09:06Z)
Active Inference in Robotics and Artificial Agents: Survey and Challenges [51.29077770446286]
We review the state-of-the-art theory and implementations of active inference for state-estimation, control, planning and learning. We showcase relevant experiments that illustrate its potential in terms of adaptation, generalization and robustness.
arXiv Detail & Related papers (2021-12-03T12:10:26Z)
Collective eXplainable AI: Explaining Cooperative Strategies and Agent Contribution in Multiagent Reinforcement Learning with Shapley Values [68.8204255655161]
This study proposes a novel approach to explain cooperative strategies in multiagent RL using Shapley values. Results could have implications for non-discriminatory decision making, ethical and responsible AI-derived decisions or policy making under fairness constraints.
arXiv Detail & Related papers (2021-10-04T10:28:57Z)
Learning "What-if" Explanations for Sequential Decision-Making [92.8311073739295]
Building interpretable parameterizations of real-world decision-making on the basis of demonstrated behavior is essential. We propose learning explanations of expert decisions by modeling their reward function in terms of preferences with respect to "what if" outcomes. We highlight the effectiveness of our batch, counterfactual inverse reinforcement learning approach in recovering accurate and interpretable descriptions of behavior.
arXiv Detail & Related papers (2020-07-02T14:24:17Z)
What can I do here? A Theory of Affordances in Reinforcement Learning [65.70524105802156]
We develop a theory of affordances for agents who learn and plan in Markov Decision Processes. Affordances play a dual role in this case, by reducing the number of actions available in any given situation. We propose an approach to learn affordances and use it to estimate transition models that are simpler and generalize better.
arXiv Detail & Related papers (2020-06-26T16:34:53Z)
Weakly-Supervised Disentanglement Without Compromises [53.55580957483103]
Intelligent agents should be able to learn useful representations by observing changes in their environment. We model such observations as pairs of non-i.i.d. images sharing at least one of the underlying factors of variation. We show that only knowing how many factors have changed, but not which ones, is sufficient to learn disentangled representations.
arXiv Detail & Related papers (2020-02-07T16:39:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.