To what extent do human explanations of model behavior align with actual
model behavior?
- URL: http://arxiv.org/abs/2012.13354v1
- Date: Thu, 24 Dec 2020 17:40:06 GMT
- Title: To what extent do human explanations of model behavior align with actual
model behavior?
- Authors: Grusha Prasad and Yixin Nie and Mohit Bansal and Robin Jia and Douwe
Kiela and Adina Williams
- Abstract summary: We investigated the extent to which human-generated explanations of models' inference decisions align with how models actually make these decisions.
We defined two alignment metrics that quantify how well natural language human explanations align with model sensitivity to input words.
We find that a model's alignment with human explanations is not predicted by the model's accuracy on NLI.
- Score: 91.67905128825402
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given the increasingly prominent role NLP models (will) play in our lives, it
is important to evaluate models on their alignment with human expectations of
how models behave. Using Natural Language Inference (NLI) as a case study, we
investigated the extent to which human-generated explanations of models'
inference decisions align with how models actually make these decisions. More
specifically, we defined two alignment metrics that quantify how well natural
language human explanations align with model sensitivity to input words, as
measured by integrated gradients. Then, we evaluated six different transformer
models (the base and large versions of BERT, RoBERTa and ELECTRA), and found
that the BERT-base model has the highest alignment with human-generated
explanations, for both alignment metrics. Additionally, the base versions of
the models we surveyed tended to have higher alignment with human-generated
explanations than their larger counterparts, suggesting that increasing the
number model parameters could result in worse alignment with human
explanations. Finally, we find that a model's alignment with human explanations
is not predicted by the model's accuracy on NLI, suggesting that accuracy and
alignment are orthogonal, and both are important ways to evaluate models.
Related papers
- Did the Models Understand Documents? Benchmarking Models for Language
Understanding in Document-Level Relation Extraction [2.4665182280122577]
Document-level relation extraction (DocRE) attracts more research interest recently.
While models achieve consistent performance gains in DocRE, their underlying decision rules are still understudied.
In this paper, we take the first step toward answering this question and then introduce a new perspective on comprehensively evaluating a model.
arXiv Detail & Related papers (2023-06-20T08:52:05Z) - Interpreting Language Models with Contrastive Explanations [99.7035899290924]
Language models must consider various features to predict a token, such as its part of speech, number, tense, or semantics.
Existing explanation methods conflate evidence for all these features into a single explanation, which is less interpretable for human understanding.
We show that contrastive explanations are quantifiably better than non-contrastive explanations in verifying major grammatical phenomena.
arXiv Detail & Related papers (2022-02-21T18:32:24Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - On the Lack of Robust Interpretability of Neural Text Classifiers [14.685352584216757]
We assess the robustness of interpretations of neural text classifiers based on pretrained Transformer encoders.
Both tests show surprising deviations from expected behavior, raising questions about the extent of insights that practitioners may draw from interpretations.
arXiv Detail & Related papers (2021-06-08T18:31:02Z) - Explainable AI by BAPC -- Before and After correction Parameter
Comparison [0.0]
A local surrogate for an AI-model correcting a simpler 'base' model is introduced representing an analytical method to yield explanations of AI-predictions.
The AI-model approximates the residual error of the linear model and the explanations are formulated in terms of the change of the interpretable base model's parameters.
arXiv Detail & Related papers (2021-03-12T09:03:51Z) - PSD2 Explainable AI Model for Credit Scoring [0.0]
The aim of this project is to develop and test advanced analytical methods to improve the prediction accuracy of Credit Risk Models.
The project focuses on applying an explainable machine learning model to bank-related databases.
arXiv Detail & Related papers (2020-11-20T12:12:38Z) - Explaining and Improving Model Behavior with k Nearest Neighbor
Representations [107.24850861390196]
We propose using k nearest neighbor representations to identify training examples responsible for a model's predictions.
We show that kNN representations are effective at uncovering learned spurious associations.
Our results indicate that the kNN approach makes the finetuned model more robust to adversarial inputs.
arXiv Detail & Related papers (2020-10-18T16:55:25Z) - Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial
Explanations of Their Behavior in Natural Language? [86.60613602337246]
We introduce a leakage-adjusted simulatability (LAS) metric for evaluating NL explanations.
LAS measures how well explanations help an observer predict a model's output, while controlling for how explanations can directly leak the output.
We frame explanation generation as a multi-agent game and optimize explanations for simulatability while penalizing label leakage.
arXiv Detail & Related papers (2020-10-08T16:59:07Z) - Are Visual Explanations Useful? A Case Study in Model-in-the-Loop
Prediction [49.254162397086006]
We study explanations based on visual saliency in an image-based age prediction task.
We find that presenting model predictions improves human accuracy.
However, explanations of various kinds fail to significantly alter human accuracy or trust in the model.
arXiv Detail & Related papers (2020-07-23T20:39:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.