Influence Tuning: Demoting Spurious Correlations via Instance
Attribution and Instance-Driven Updates
- URL: http://arxiv.org/abs/2110.03212v1
- Date: Thu, 7 Oct 2021 06:59:46 GMT
- Title: Influence Tuning: Demoting Spurious Correlations via Instance
Attribution and Instance-Driven Updates
- Authors: Xiaochuang Han, Yulia Tsvetkov
- Abstract summary: influence tuning can help deconfounding the model from spurious patterns in data.
We show that in a controlled setup, influence tuning can help deconfounding the model from spurious patterns in data.
- Score: 26.527311287924995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Among the most critical limitations of deep learning NLP models are their
lack of interpretability, and their reliance on spurious correlations. Prior
work proposed various approaches to interpreting the black-box models to unveil
the spurious correlations, but the research was primarily used in
human-computer interaction scenarios. It still remains underexplored whether or
how such model interpretations can be used to automatically "unlearn"
confounding features. In this work, we propose influence tuning--a procedure
that leverages model interpretations to update the model parameters towards a
plausible interpretation (rather than an interpretation that relies on spurious
patterns in the data) in addition to learning to predict the task labels. We
show that in a controlled setup, influence tuning can help deconfounding the
model from spurious patterns in data, significantly outperforming baseline
methods that use adversarial training.
Related papers
- Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - Causal Analysis for Robust Interpretability of Neural Networks [0.2519906683279152]
We develop a robust interventional-based method to capture cause-effect mechanisms in pre-trained neural networks.
We apply our method to vision models trained on classification tasks.
arXiv Detail & Related papers (2023-05-15T18:37:24Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - A Detailed Study of Interpretability of Deep Neural Network based Top
Taggers [3.8541104292281805]
Recent developments in explainable AI (XAI) allow researchers to explore the inner workings of deep neural networks (DNNs)
We explore interpretability of models designed to identify jets coming from top quark decay in high energy proton-proton collisions at the Large Hadron Collider (LHC)
Our studies uncover some major pitfalls of existing XAI methods and illustrate how they can be overcome to obtain consistent and meaningful interpretation of these models.
arXiv Detail & Related papers (2022-10-09T23:02:42Z) - How robust are pre-trained models to distribution shift? [82.08946007821184]
We show how spurious correlations affect the performance of popular self-supervised learning (SSL) and auto-encoder based models (AE)
We develop a novel evaluation scheme with the linear head trained on out-of-distribution (OOD) data, to isolate the performance of the pre-trained models from a potential bias of the linear head used for evaluation.
arXiv Detail & Related papers (2022-06-17T16:18:28Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Correlation inference attacks against machine learning models [6.805105137455252]
We explore correlation inference attacks, whether and when a model leaks information about the correlations between its input variables.
Our results raise fundamental questions on what a model does and should remember from its training set.
arXiv Detail & Related papers (2021-12-16T11:42:45Z) - Refining Neural Networks with Compositional Explanations [31.84868477264624]
We propose to refine a learned model by collecting human-provided compositional explanations on the models' failure cases.
We demonstrate the effectiveness of the proposed approach on two text classification tasks.
arXiv Detail & Related papers (2021-03-18T17:48:54Z) - Recoding latent sentence representations -- Dynamic gradient-based
activation modification in RNNs [0.0]
In RNNs, encoding information in a suboptimal way can impact the quality of representations based on later elements in the sequence.
I propose an augmentation to standard RNNs in form of a gradient-based correction mechanism.
I conduct different experiments in the context of language modeling, where the impact of using such a mechanism is examined in detail.
arXiv Detail & Related papers (2021-01-03T17:54:17Z) - Learning from others' mistakes: Avoiding dataset biases without modeling
them [111.17078939377313]
State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended task.
Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available.
We show a method for training models that learn to ignore these problematic correlations.
arXiv Detail & Related papers (2020-12-02T16:10:54Z) - Explaining and Improving Model Behavior with k Nearest Neighbor
Representations [107.24850861390196]
We propose using k nearest neighbor representations to identify training examples responsible for a model's predictions.
We show that kNN representations are effective at uncovering learned spurious associations.
Our results indicate that the kNN approach makes the finetuned model more robust to adversarial inputs.
arXiv Detail & Related papers (2020-10-18T16:55:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.