Related papers: Explanation Design in Strategic Learning: Sufficient Explanations that Induce Non-harmful Responses

Explanation Design in Strategic Learning: Sufficient Explanations that Induce Non-harmful Responses

URL: http://arxiv.org/abs/2502.04058v2
Date: Wed, 28 May 2025 14:05:11 GMT
Title: Explanation Design in Strategic Learning: Sufficient Explanations that Induce Non-harmful Responses
Authors: Kiet Q. H. Vo, Siu Lun Chau, Masahiro Kato, Yixin Wang, Krikamol Muandet,
Abstract summary: Key open question is how DMs can communicate explanations in a way that avoids harming strategic agents.<n>We prove that action recommendation-based explanations (ARexes) are sufficient for non-harmful responses.<n>Experiments show that ARexes allow the DM to optimise their model's predictive performance while preserving agents' utility.
Score: 29.57116418734347
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study explanation design in algorithmic decision making with strategic agents, individuals who may modify their inputs in response to explanations of a decision maker's (DM's) predictive model. As the demand for transparent algorithmic systems continues to grow, most prior work assumes full model disclosure as the default solution. In practice, however, DMs such as financial institutions typically disclose only partial model information via explanations. Such partial disclosure can lead agents to misinterpret the model and take actions that unknowingly harm their utility. A key open question is how DMs can communicate explanations in a way that avoids harming strategic agents, while still supporting their own decision-making goals, e.g., minimising predictive error. In this work, we analyse well-known explanation methods, and establish a necessary condition to prevent explanations from misleading agents into self-harming actions. Moreover, with a conditional homogeneity assumption, we prove that action recommendation-based explanations (ARexes) are sufficient for non-harmful responses, mirroring the revelation principle in information design. To demonstrate how ARexes can be operationalised in practice, we propose a simple learning procedure that jointly optimises the predictive model and explanation policy. Experiments on synthetic and real-world tasks show that ARexes allow the DM to optimise their model's predictive performance while preserving agents' utility, offering a more refined strategy for safe and effective partial model disclosure.

Related papers

Interpreting Emergent Planning in Model-Free Reinforcement Learning [13.820891288919002]
We present the first evidence that model-free reinforcement learning agents can learn to plan.<n>This is achieved by applying a methodology based on concept-based interpretability to a model-free agent in Sokoban.
arXiv Detail & Related papers (2025-04-02T16:24:23Z)
On Generating Monolithic and Model Reconciling Explanations in Probabilistic Scenarios [46.752418052725126]
We propose a novel framework for generating probabilistic monolithic explanations and model reconciling explanations. For monolithic explanations, our approach integrates uncertainty by utilizing probabilistic logic to increase the probability of the explanandum. For model reconciling explanations, we propose a framework that extends the logic-based variant of the model reconciliation problem to account for probabilistic human models.
arXiv Detail & Related papers (2024-05-29T16:07:31Z)
Robust Explainable Recommendation [10.186029242664931]
We present a general framework for feature-aware explainable recommenders that can withstand external attacks. Our framework is simple to implement and supports different methods regardless of the internal model structure and intrinsic utility within any model.
arXiv Detail & Related papers (2024-05-03T05:03:07Z)
Personalized Decision Supports based on Theory of Mind Modeling and Explainable Reinforcement Learning [0.9071985476473737]
We propose a novel personalized decision support system that combines Theory of Mind (ToM) modeling and explainable Reinforcement Learning (XRL) Our proposed system generates accurate and personalized interventions that are easily interpretable by end-users.
arXiv Detail & Related papers (2023-12-13T00:37:17Z)
Evaluating the Utility of Model Explanations for Model Development [54.23538543168767]
We evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development. To our surprise, we did not find evidence of significant improvement on tasks when users were provided with any of the saliency maps. These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
arXiv Detail & Related papers (2023-12-10T23:13:23Z)
Understanding the (Extra-)Ordinary: Validating Deep Model Decisions with Prototypical Concept-based Explanations [13.60538902487872]
We present a novel post-hoc concept-based XAI framework that conveys besides instance-wise (local) also class-wise (global) decision-making strategies via prototypes. We demonstrate the effectiveness of our approach in identifying out-of-distribution samples, spurious model behavior and data quality issues across three datasets.
arXiv Detail & Related papers (2023-11-28T10:53:26Z)
Learning Transferable Conceptual Prototypes for Interpretable Unsupervised Domain Adaptation [79.22678026708134]
In this paper, we propose an inherently interpretable method, named Transferable Prototype Learning ( TCPL) To achieve this goal, we design a hierarchically prototypical module that transfers categorical basic concepts from the source domain to the target domain and learns domain-shared prototypes for explaining the underlying reasoning process. Comprehensive experiments show that the proposed method can not only provide effective and intuitive explanations but also outperform previous state-of-the-arts.
arXiv Detail & Related papers (2023-10-12T06:36:41Z)
Explainability for Large Language Models: A Survey [59.67574757137078]
Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
arXiv Detail & Related papers (2023-09-02T22:14:26Z)
Explainable Data-Driven Optimization: From Context to Decision and Back Again [76.84947521482631]
Data-driven optimization uses contextual information and machine learning algorithms to find solutions to decision problems with uncertain parameters. We introduce a counterfactual explanation methodology tailored to explain solutions to data-driven problems. We demonstrate our approach by explaining key problems in operations management such as inventory management and routing.
arXiv Detail & Related papers (2023-01-24T15:25:16Z)
Explainability in Process Outcome Prediction: Guidelines to Obtain Interpretable and Faithful Models [77.34726150561087]
We define explainability through the interpretability of the explanations and the faithfulness of the explainability model in the field of process outcome prediction. This paper contributes a set of guidelines named X-MOP which allows selecting the appropriate model based on the event log specifications.
arXiv Detail & Related papers (2022-03-30T05:59:50Z)
Inverse Online Learning: Understanding Non-Stationary and Reactionary Policies [79.60322329952453]
We show how to develop interpretable representations of how agents make decisions. By understanding the decision-making processes underlying a set of observed trajectories, we cast the policy inference problem as the inverse to this online learning problem. We introduce a practical algorithm for retrospectively estimating such perceived effects, alongside the process through which agents update them. Through application to the analysis of UNOS organ donation acceptance decisions, we demonstrate that our approach can bring valuable insights into the factors that govern decision processes and how they change over time.
arXiv Detail & Related papers (2022-03-14T17:40:42Z)
Feature Attributions and Counterfactual Explanations Can Be Manipulated [32.579094387004346]
We show how adversaries can design biased models that manipulate model agnostic feature attribution methods. These vulnerabilities allow an adversary to deploy a biased model, yet explanations will not reveal this bias, thereby deceiving stakeholders into trusting the model. We evaluate the manipulations on real world data sets, including COMPAS and Communities & Crime, and find explanations can be manipulated in practice.
arXiv Detail & Related papers (2021-06-23T17:43:31Z)
Insights into Data through Model Behaviour: An Explainability-driven Strategy for Data Auditing for Responsible Computer Vision Applications [70.92379567261304]
This study explores an explainability-driven strategy to data auditing. We demonstrate this strategy by auditing two popular medical benchmark datasets. We discover hidden data quality issues that lead deep learning models to make predictions for the wrong reasons.
arXiv Detail & Related papers (2021-06-16T23:46:39Z)
Feature-Based Interpretable Reinforcement Learning based on State-Transition Models [3.883460584034766]
Growing concerns regarding the operational usage of AI models in the real-world has caused a surge of interest in explaining AI models' decisions to humans. We propose a method for offering local explanations on risk in reinforcement learning.
arXiv Detail & Related papers (2021-05-14T23:43:11Z)
Beyond Trivial Counterfactual Explanations with Diverse Valuable Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction. We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss. Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z)
Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions. We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions. We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z)
Beyond Individualized Recourse: Interpretable and Interactive Summaries of Actionable Recourses [14.626432428431594]
We propose a novel model framework called Actionable Recourse agnostic (AReS) to construct global counterfactual explanations. We formulate a novel objective which simultaneously optimize for correctness of the recourses and interpretability of the explanations. Our framework can provide decision makers with a comprehensive overview of recourses corresponding to any black box model.
arXiv Detail & Related papers (2020-09-15T15:14:08Z)
Decisions, Counterfactual Explanations and Strategic Behavior [16.980621769406923]
We find policies and counterfactual explanations that are optimal in terms of utility in a strategic setting. We show that, given a pre-defined policy, the problem of finding the optimal set of counterfactual explanations is NP-hard. We demonstrate that, by incorporating a matroid constraint into the problem formulation, we can increase the diversity of the optimal set of counterfactual explanations.
arXiv Detail & Related papers (2020-02-11T12:04:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.