Related papers: Counterfactual Training: Teaching Models Plausible and Actionable Explanations

Counterfactual Training: Teaching Models Plausible and Actionable Explanations

URL: http://arxiv.org/abs/2601.16205v1
Date: Thu, 22 Jan 2026 18:56:14 GMT
Title: Counterfactual Training: Teaching Models Plausible and Actionable Explanations
Authors: Patrick Altmeyer, Aleksander Buszydlik, Arie van Deursen, Cynthia C. S. Liem,
Abstract summary: We propose a novel training regime termed counterfactual training to increase the explanatory capacity of models.<n>Counterfactual explanations have emerged as a popular post-hoc explanation method for opaque machine learning models.
Score: 52.967743166658984
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose a novel training regime termed counterfactual training that leverages counterfactual explanations to increase the explanatory capacity of models. Counterfactual explanations have emerged as a popular post-hoc explanation method for opaque machine learning models: they inform how factual inputs would need to change in order for a model to produce some desired output. To be useful in real-world decision-making systems, counterfactuals should be plausible with respect to the underlying data and actionable with respect to the feature mutability constraints. Much existing research has therefore focused on developing post-hoc methods to generate counterfactuals that meet these desiderata. In this work, we instead hold models directly accountable for the desired end goal: counterfactual training employs counterfactuals during the training phase to minimize the divergence between learned representations and plausible, actionable explanations. We demonstrate empirically and theoretically that our proposed method facilitates training models that deliver inherently desirable counterfactual explanations and additionally exhibit improved adversarial robustness.

Related papers

LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow Matching [8.220601095681355]
We propose LeapFactual, a novel counterfactual explanation algorithm based on conditional flow matching.<n> LeapFactual generates reliable and informative counterfactuals, even when true and learned decision boundaries diverge.<n>It can handle human-in-the-loop systems, expanding the scope of counterfactual explanations to domains that require the participation of human annotators.
arXiv Detail & Related papers (2025-10-16T12:34:10Z)
Enhancing XAI Narratives through Multi-Narrative Refinement and Knowledge Distillation [13.523610021268363]
Counterfactual explanations offer insights into model behavior by highlighting minimal changes that would alter a prediction.<n>Despite their potential, these explanations are often complex and technical, making them difficult for non-experts to interpret.<n>We propose a novel pipeline that leverages Language Models, large and small, to compose narratives for counterfactual explanations.
arXiv Detail & Related papers (2025-10-03T16:04:09Z)
How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations [69.72654127617058]
Post-hoc importance attribution methods are a popular tool for "explaining" Deep Neural Networks (DNNs)<n>In this work we bring forward empirical evidence that challenges this very notion.<n>We discover a strong dependency on and demonstrate that the training details of a pre-trained model's classification layer play a crucial role.
arXiv Detail & Related papers (2025-03-01T22:25:11Z)
Explainability for Large Language Models: A Survey [59.67574757137078]
Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
arXiv Detail & Related papers (2023-09-02T22:14:26Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
Beyond Trivial Counterfactual Explanations with Diverse Valuable Explanations [64.85696493596821]
In computer vision applications, generative counterfactual methods indicate how to perturb a model's input to change its prediction. We propose a counterfactual method that learns a perturbation in a disentangled latent space that is constrained using a diversity-enforcing loss. Our model improves the success rate of producing high-quality valuable explanations when compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-03-18T12:57:34Z)
Leap-Of-Thought: Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge [96.92252296244233]
Large pre-trained language models (LMs) acquire some reasoning capacity, but this ability is difficult to control. We show that LMs can be trained to reliably perform systematic reasoning combining both implicit, pre-trained knowledge and explicit natural language statements. Our work paves a path towards open-domain systems that constantly improve by interacting with users who can instantly correct a model by adding simple natural language statements.
arXiv Detail & Related papers (2020-06-11T17:02:20Z)
FairALM: Augmented Lagrangian Method for Training Fair Models with Little Regret [42.66567001275493]
It is now accepted that because of biases in the datasets we present to the models, a fairness-oblivious training will lead to unfair models. Here, we study mechanisms that impose fairness concurrently while training the model.
arXiv Detail & Related papers (2020-04-03T03:18:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.