Explaining Language Models' Predictions with High-Impact Concepts
- URL: http://arxiv.org/abs/2305.02160v1
- Date: Wed, 3 May 2023 14:48:27 GMT
- Title: Explaining Language Models' Predictions with High-Impact Concepts
- Authors: Ruochen Zhao, Shafiq Joty, Yongjie Wang, Tan Wang
- Abstract summary: We propose a complete framework for extending concept-based interpretability methods to NLP.
We optimize for features whose existence causes the output predictions to change substantially.
Our method achieves superior results on predictive impact, usability, and faithfulness compared to the baselines.
- Score: 11.47612457613113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The emergence of large-scale pretrained language models has posed
unprecedented challenges in deriving explanations of why the model has made
some predictions. Stemmed from the compositional nature of languages, spurious
correlations have further undermined the trustworthiness of NLP systems,
leading to unreliable model explanations that are merely correlated with the
output predictions. To encourage fairness and transparency, there exists an
urgent demand for reliable explanations that allow users to consistently
understand the model's behavior. In this work, we propose a complete framework
for extending concept-based interpretability methods to NLP. Specifically, we
propose a post-hoc interpretability method for extracting predictive high-level
features (concepts) from the pretrained model's hidden layer activations. We
optimize for features whose existence causes the output predictions to change
substantially, \ie generates a high impact. Moreover, we devise several
evaluation metrics that can be universally applied. Extensive experiments on
real and synthetic tasks demonstrate that our method achieves superior results
on {predictive impact}, usability, and faithfulness compared to the baselines.
Related papers
- On Uncertainty In Natural Language Processing [2.5076643086429993]
This thesis studies how uncertainty in natural language processing can be characterized from a linguistic, statistical and neural perspective.
We propose a method for calibrated sampling in natural language generation based on non-exchangeable conformal prediction.
Lastly, we develop an approach to quantify confidence in large black-box language models using auxiliary predictors.
arXiv Detail & Related papers (2024-10-04T14:08:02Z) - Enhancing adversarial robustness in Natural Language Inference using explanations [41.46494686136601]
We cast the spotlight on the underexplored task of Natural Language Inference (NLI)
We validate the usage of natural language explanation as a model-agnostic defence strategy through extensive experimentation.
We research the correlation of widely used language generation metrics with human perception, in order for them to serve as a proxy towards robust NLI models.
arXiv Detail & Related papers (2024-09-11T17:09:49Z) - DEAL: Disentangle and Localize Concept-level Explanations for VLMs [10.397502254316645]
Large pre-trained Vision-Language Models might not be able to identify fine-grained concepts.
We propose to DisEnt and Localize (Angle) concept-level explanations for concepts without human annotations.
Our empirical results demonstrate that the proposed method significantly improves the concept-level explanations of the model in terms of disentanglability and localizability.
arXiv Detail & Related papers (2024-07-19T15:39:19Z) - Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks.
The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation.
We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z) - Improving Language Models Meaning Understanding and Consistency by
Learning Conceptual Roles from Dictionary [65.268245109828]
Non-human-like behaviour of contemporary pre-trained language models (PLMs) is a leading cause undermining their trustworthiness.
A striking phenomenon is the generation of inconsistent predictions, which produces contradictory results.
We propose a practical approach that alleviates the inconsistent behaviour issue by improving PLM awareness.
arXiv Detail & Related papers (2023-10-24T06:15:15Z) - Evaluating and Explaining Large Language Models for Code Using Syntactic
Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code.
At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes.
We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z) - Counterfactuals of Counterfactuals: a back-translation-inspired approach
to analyse counterfactual editors [3.4253416336476246]
We focus on the analysis of counterfactual, contrastive explanations.
We propose a new back translation-inspired evaluation methodology.
We show that by iteratively feeding the counterfactual to the explainer we can obtain valuable insights into the behaviour of both the predictor and the explainer models.
arXiv Detail & Related papers (2023-05-26T16:04:28Z) - Token-wise Decomposition of Autoregressive Language Model Hidden States
for Analyzing Model Predictions [9.909170013118775]
This work presents a linear decomposition of final hidden states from autoregressive language models based on each initial input token.
Using the change in next-word probability as a measure of importance, this work first examines which context words make the biggest contribution to language model predictions.
arXiv Detail & Related papers (2023-05-17T23:55:32Z) - Pathologies of Pre-trained Language Models in Few-shot Fine-tuning [50.3686606679048]
We show that pre-trained language models with few examples show strong prediction bias across labels.
Although few-shot fine-tuning can mitigate the prediction bias, our analysis shows models gain performance improvement by capturing non-task-related features.
These observations alert that pursuing model performance with fewer examples may incur pathological prediction behavior.
arXiv Detail & Related papers (2022-04-17T15:55:18Z) - Generative Counterfactuals for Neural Networks via Attribute-Informed
Perturbation [51.29486247405601]
We design a framework to generate counterfactuals for raw data instances with the proposed Attribute-Informed Perturbation (AIP)
By utilizing generative models conditioned with different attributes, counterfactuals with desired labels can be obtained effectively and efficiently.
Experimental results on real-world texts and images demonstrate the effectiveness, sample quality as well as efficiency of our designed framework.
arXiv Detail & Related papers (2021-01-18T08:37:13Z) - Explaining and Improving Model Behavior with k Nearest Neighbor
Representations [107.24850861390196]
We propose using k nearest neighbor representations to identify training examples responsible for a model's predictions.
We show that kNN representations are effective at uncovering learned spurious associations.
Our results indicate that the kNN approach makes the finetuned model more robust to adversarial inputs.
arXiv Detail & Related papers (2020-10-18T16:55:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.