Explanation Regularisation through the Lens of Attributions
- URL: http://arxiv.org/abs/2407.16693v1
- Date: Tue, 23 Jul 2024 17:56:32 GMT
- Title: Explanation Regularisation through the Lens of Attributions
- Authors: Pedro Ferreira, Wilker Aziz, Ivan Titov,
- Abstract summary: Explanation regularisation (ER) has been introduced as a way to guide models to make their predictions "plausible"
This work contributes a study of ER's effectiveness at informing classification decisions on plausible tokens.
- Score: 30.68740512996253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Explanation regularisation (ER) has been introduced as a way to guide models to make their predictions in a manner more akin to humans, i.e., making their attributions "plausible". This is achieved by introducing an auxiliary explanation loss, that measures how well the output of an input attribution technique for the model agrees with relevant human-annotated rationales. One positive outcome of using ER appears to be improved performance in out-of-domain (OOD) settings, presumably due to an increased reliance on "plausible" tokens. However, previous work has under-explored the impact of the ER objective on model attributions, in particular when obtained with techniques other than the one used to train ER. In this work, we contribute a study of ER's effectiveness at informing classification decisions on plausible tokens, and the relationship between increased plausibility and robustness to OOD conditions. Through a series of analyses, we find that the connection between ER and the ability of a classifier to rely on plausible features has been overstated and that a stronger reliance on plausible tokens does not seem to be the cause for any perceived OOD improvements.
Related papers
- Identifiable Latent Neural Causal Models [82.14087963690561]
Causal representation learning seeks to uncover latent, high-level causal representations from low-level observed data.
We determine the types of distribution shifts that do contribute to the identifiability of causal representations.
We translate our findings into a practical algorithm, allowing for the acquisition of reliable latent causal representations.
arXiv Detail & Related papers (2024-03-23T04:13:55Z) - REFER: An End-to-end Rationale Extraction Framework for Explanation
Regularization [12.409398096527829]
We propose REFER, a framework that employs a differentiable rationale extractor that allows to back-propagate through the rationale extraction process.
We analyze the impact of using human highlights during training by jointly training the task model and the rationale extractor.
arXiv Detail & Related papers (2023-10-22T21:20:52Z) - Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection [73.31406286956535]
We introduce the Ladder-of-Thought (LoT) for the stance detection task.
LoT directs the small LMs to assimilate high-quality external knowledge, refining the intermediate rationales produced.
Our empirical evaluations underscore LoT's efficacy, marking a 16% improvement over GPT-3.5 and a 10% enhancement compared to GPT-3.5 with CoT on stance detection task.
arXiv Detail & Related papers (2023-08-31T14:31:48Z) - Why Does Little Robustness Help? Understanding and Improving Adversarial
Transferability from Surrogate Training [24.376314203167016]
Adversarial examples (AEs) for DNNs have been shown to be transferable.
In this paper, we take a further step towards understanding adversarial transferability.
arXiv Detail & Related papers (2023-07-15T19:20:49Z) - Be Your Own Neighborhood: Detecting Adversarial Example by the
Neighborhood Relations Built on Self-Supervised Learning [64.78972193105443]
This paper presents a novel AE detection framework, named trustworthy for predictions.
performs the detection by distinguishing the AE's abnormal relation with its augmented versions.
An off-the-shelf Self-Supervised Learning (SSL) model is used to extract the representation and predict the label.
arXiv Detail & Related papers (2022-08-31T08:18:44Z) - Exploiting the Relationship Between Kendall's Rank Correlation and
Cosine Similarity for Attribution Protection [21.341303776931532]
We first show that the expected Kendall's rank correlation is positively correlated to cosine similarity and then indicate that the direction of attribution is the key to attribution robustness.
Our analysis further exposes that IGR encourages neurons with the same activation states for natural samples and the corresponding perturbed samples, which is shown to induce robustness to gradient-based attribution methods.
arXiv Detail & Related papers (2022-05-15T13:08:50Z) - Effective Explanations for Entity Resolution Models [21.518135952436975]
We study the fundamental problem of explainability of the deep learning solution for ER.
We propose the CERTA approach that is aware of the semantics of the ER problem.
We experimentally evaluate CERTA's explanations of state-of-the-art ER solutions based on DL models using publicly available datasets.
arXiv Detail & Related papers (2022-03-24T10:50:05Z) - Deconfounding Scores: Feature Representations for Causal Effect
Estimation with Weak Overlap [140.98628848491146]
We introduce deconfounding scores, which induce better overlap without biasing the target of estimation.
We show that deconfounding scores satisfy a zero-covariance condition that is identifiable in observed data.
In particular, we show that this technique could be an attractive alternative to standard regularizations.
arXiv Detail & Related papers (2021-04-12T18:50:11Z) - Latent Causal Invariant Model [128.7508609492542]
Current supervised learning can learn spurious correlation during the data-fitting process.
We propose a Latent Causal Invariance Model (LaCIM) which pursues causal prediction.
arXiv Detail & Related papers (2020-11-04T10:00:27Z) - Self-Attention Attribution: Interpreting Information Interactions Inside
Transformer [89.21584915290319]
We propose a self-attention attribution method to interpret the information interactions inside Transformer.
We show that the attribution results can be used as adversarial patterns to implement non-targeted attacks towards BERT.
arXiv Detail & Related papers (2020-04-23T14:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.