Related papers: An Interpretability Illusion for BERT

An Interpretability Illusion for BERT

URL: http://arxiv.org/abs/2104.07143v1
Date: Wed, 14 Apr 2021 22:04:48 GMT
Title: An Interpretability Illusion for BERT
Authors: Tolga Bolukbasi, Adam Pearce, Ann Yuan, Andy Coenen, Emily Reif, Fernanda Vi\'egas, Martin Wattenberg
Abstract summary: We describe an "interpretability illusion" that arises when analyzing the BERT model. We trace the source of this illusion to geometric properties of BERT's embedding space. We provide a taxonomy of model-learned concepts and discuss methodological implications for interpretability research.
Score: 61.2687465308121
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We describe an "interpretability illusion" that arises when analyzing the BERT model. Activations of individual neurons in the network may spuriously appear to encode a single, simple concept, when in fact they are encoding something far more complex. The same effect holds for linear combinations of activations. We trace the source of this illusion to geometric properties of BERT's embedding space as well as the fact that common text corpora represent only narrow slices of possible English sentences. We provide a taxonomy of model-learned concepts and discuss methodological implications for interpretability research, especially the importance of testing hypotheses on multiple data sets.

Related papers

The Origins of Representation Manifolds in Large Language Models [52.68554895844062]
We show that cosine similarity in representation space may encode the intrinsic geometry of a feature through shortest, on-manifold paths.<n>The critical assumptions and predictions of the theory are validated on text embeddings and token activations of large language models.
arXiv Detail & Related papers (2025-05-23T13:31:22Z)
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z)
Disentangling Dense Embeddings with Sparse Autoencoders [0.0]
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks. We present one of the first applications of SAEs to dense text embeddings from large language models. We show that the resulting sparse representations maintain semantic fidelity while offering interpretability.
arXiv Detail & Related papers (2024-08-01T15:46:22Z)
On the Joint Interaction of Models, Data, and Features [82.60073661644435]
We introduce a new tool, the interaction tensor, for empirically analyzing the interaction between data and model through features. Based on these observations, we propose a conceptual framework for feature learning. Under this framework, the expected accuracy for a single hypothesis and agreement for a pair of hypotheses can both be derived in closed-form.
arXiv Detail & Related papers (2023-06-07T21:35:26Z)
All Roads Lead to Rome? Exploring the Invariance of Transformers' Representations [69.3461199976959]
We propose a model based on invertible neural networks, BERT-INN, to learn the Bijection Hypothesis. We show the advantage of BERT-INN both theoretically and through extensive experiments.
arXiv Detail & Related papers (2023-05-23T22:30:43Z)
Semantic interpretation for convolutional neural networks: What makes a cat a cat? [3.132595571344153]
We introduce the framework of semantic explainable AI (S-XAI) S-XAI uses row-centered principal component analysis to obtain the common traits from the best combination of superpixels discovered by a genetic algorithm. It extracts understandable semantic spaces on the basis of discovered semantically sensitive neurons and visualization techniques.
arXiv Detail & Related papers (2022-04-16T05:25:17Z)
A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes. We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z)
The Low-Dimensional Linear Geometry of Contextualized Word Representations [27.50785941238007]
We study the linear geometry of contextualized word representations in ELMO and BERT. We show that a variety of linguistic features are encoded in low-dimensional subspaces.
arXiv Detail & Related papers (2021-05-15T00:58:08Z)
Exploring the Role of BERT Token Representations to Explain Sentence Probing Results [15.652077779677091]
We show that BERT tends to encode meaningful knowledge in specific token representations. This allows the model to detect syntactic and semantic abnormalities and to distinctively separate grammatical number and tense subspaces.
arXiv Detail & Related papers (2021-04-03T20:40:42Z)
On the Sentence Embeddings from Pre-trained Language Models [78.45172445684126]
In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited. We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity. We propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective.
arXiv Detail & Related papers (2020-11-02T13:14:57Z)
Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks. Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it. In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z)
INFOTABS: Inference on Tables as Semi-structured Data [39.84930221015755]
We introduce a new dataset called INFOTABS, comprising of human-written textual hypotheses based on premises that are tables extracted from Wikipedia info-boxes. Our analysis shows that the semi-structured, multi-domain and heterogeneous nature of the premises admits complex, multi-faceted reasoning. Experiments reveal that, while human annotators agree on the relationships between a table-hypothesis pair, several standard modeling strategies are unsuccessful at the task.
arXiv Detail & Related papers (2020-05-13T02:07:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.