Intervention Lens: from Representation Surgery to String Counterfactuals
- URL: http://arxiv.org/abs/2402.11355v4
- Date: Sun, 20 Oct 2024 20:18:06 GMT
- Title: Intervention Lens: from Representation Surgery to String Counterfactuals
- Authors: Matan Avitan, Ryan Cotterell, Yoav Goldberg, Shauli Ravfogel,
- Abstract summary: Interventions targeting the representation space of language models (LMs) have emerged as an effective means to influence model behavior.
We give a method to convert representation counterfactuals into string counterfactuals.
The resulting counterfactuals can be used to mitigate bias in classification through data augmentation.
- Score: 106.98481791980367
- License:
- Abstract: Interventions targeting the representation space of language models (LMs) have emerged as an effective means to influence model behavior. Such methods are employed, for example, to eliminate or alter the encoding of demographic information such as gender within the model's representations and, in so doing, create a counterfactual representation. However, because the intervention operates within the representation space, understanding precisely what aspects of the text it modifies poses a challenge. In this paper, we give a method to convert representation counterfactuals into string counterfactuals. We demonstrate that this approach enables us to analyze the linguistic alterations corresponding to a given representation space intervention and to interpret the features utilized to encode a specific concept. Moreover, the resulting counterfactuals can be used to mitigate bias in classification through data augmentation.
Related papers
- Counterfactual Generation from Language Models [64.55296662926919]
We show that counterfactual reasoning is conceptually distinct from interventions.
We propose a framework for generating true string counterfactuals.
Our experiments demonstrate that the approach produces meaningful counterfactuals.
arXiv Detail & Related papers (2024-11-11T17:57:30Z) - Representations as Language: An Information-Theoretic Framework for Interpretability [7.2129390689756185]
Large scale neural models show impressive performance across a wide array of linguistic tasks.
Despite this they remain, largely, black-boxes, inducing vector-representations of their input that prove difficult to interpret.
We introduce a novel approach to interpretability that looks at the mapping a model learns from sentences to representations as a kind of language in its own right.
arXiv Detail & Related papers (2024-06-04T16:14:00Z) - Latent Space Explanation by Intervention [16.43087660376697]
This study aims to reveal hidden concepts by employing an intervention mechanism that shifts the predicted class based on discrete variational autoencoders.
An explanatory model then visualizes encoded information from any hidden layer and its corresponding intervened representation.
arXiv Detail & Related papers (2021-12-09T13:23:19Z) - Counterfactual Interventions Reveal the Causal Effect of Relative Clause
Representations on Agreement Prediction [61.4913233397155]
We show that BERT uses information about RC spans during agreement prediction using the linguistically strategy.
We also found that counterfactual representations generated for a specific RC subtype influenced the number prediction in sentences with other RC subtypes, suggesting that information about RC boundaries was encoded abstractly in BERT's representation.
arXiv Detail & Related papers (2021-05-14T17:11:55Z) - "Let's Eat Grandma": When Punctuation Matters in Sentence Representation
for Sentiment Analysis [13.873803872380229]
We argue that punctuation could play a significant role in sentiment analysis and propose a novel representation model to improve syntactic and contextual performance.
We conduct experiments on publicly available datasets and verify that our model can identify the sentiments more accurately over other state-of-the-art baseline methods.
arXiv Detail & Related papers (2020-12-10T19:07:31Z) - Unsupervised Distillation of Syntactic Information from Contextualized
Word Representations [62.230491683411536]
We tackle the task of unsupervised disentanglement between semantics and structure in neural language representations.
To this end, we automatically generate groups of sentences which are structurally similar but semantically different.
We demonstrate that our transformation clusters vectors in space by structural properties, rather than by lexical semantics.
arXiv Detail & Related papers (2020-10-11T15:13:18Z) - Assessing Phrasal Representation and Composition in Transformers [13.460125148455143]
Deep transformer models have pushed performance on NLP tasks to new limits.
We present systematic analysis of phrasal representations in state-of-the-art pre-trained transformers.
We find that phrase representation in these models relies heavily on word content, with little evidence of nuanced composition.
arXiv Detail & Related papers (2020-10-08T04:59:39Z) - Explaining Black Box Predictions and Unveiling Data Artifacts through
Influence Functions [55.660255727031725]
Influence functions explain the decisions of a model by identifying influential training examples.
We conduct a comparison between influence functions and common word-saliency methods on representative tasks.
We develop a new measure based on influence functions that can reveal artifacts in training data.
arXiv Detail & Related papers (2020-05-14T00:45:23Z) - Analysing Lexical Semantic Change with Contextualised Word
Representations [7.071298726856781]
We propose a novel method that exploits the BERT neural language model to obtain representations of word usages.
We create a new evaluation dataset and show that the model representations and the detected semantic shifts are positively correlated with human judgements.
arXiv Detail & Related papers (2020-04-29T12:18:14Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.