Sparse Interventions in Language Models with Differentiable Masking
- URL: http://arxiv.org/abs/2112.06837v1
- Date: Mon, 13 Dec 2021 17:49:16 GMT
- Title: Sparse Interventions in Language Models with Differentiable Masking
- Authors: Nicola De Cao, Leon Schmid, Dieuwke Hupkes, Ivan Titov
- Abstract summary: We propose a method that discovers within a neural LM a small subset of neurons responsible for a linguistic phenomenon.
Our experiments confirm that each of these phenomenons is mediated through a small subset of neurons.
- Score: 37.220380160016624
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There has been a lot of interest in understanding what information is
captured by hidden representations of language models (LMs). Typically,
interpretation methods i) do not guarantee that the model actually uses the
encoded information, and ii) do not discover small subsets of neurons
responsible for a considered phenomenon. Inspired by causal mediation analysis,
we propose a method that discovers within a neural LM a small subset of neurons
responsible for a particular linguistic phenomenon, i.e., subsets causing a
change in the corresponding token emission probabilities. We use a
differentiable relaxation to approximately search through the combinatorial
space. An $L_0$ regularization term ensures that the search converges to
discrete and sparse solutions. We apply our method to analyze subject-verb
number agreement and gender bias detection in LSTMs. We observe that it is fast
and finds better solutions than the alternative (REINFORCE). Our experiments
confirm that each of these phenomenons is mediated through a small subset of
neurons that do not play any other discernible role.
Related papers
- Emergent Symbol-like Number Variables in Artificial Neural Networks [34.388552536773034]
We aim to understand how well we can understand Neural Network (NN) solutions through interpretable algorithms.
We use GRUs, LSTMs, and Transformers trained using Next Token Prediction (NTP) on numeric tasks.
We show through multiple causal and theoretical methods that we can interpret NN's raw activity through the lens of simplified SAs.
arXiv Detail & Related papers (2025-01-10T18:03:46Z) - Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models [14.594698598522797]
Demonstrating feature universality allows discoveries about latent representations to generalize across several models.
We employ a method known as dictionary learning to transform LLM activations into interpretable spaces spanned by neurons corresponding to individual features.
Our experiments reveal significant similarities in SAE feature spaces across various LLMs, providing new evidence for feature universality.
arXiv Detail & Related papers (2024-10-09T15:18:57Z) - Sharing Matters: Analysing Neurons Across Languages and Tasks in LLMs [70.3132264719438]
We aim to fill the research gap by examining how neuron activation is shared across tasks and languages.
We classify neurons into four distinct categories based on their responses to a specific input across different languages.
Our analysis reveals the following insights: (i) the patterns of neuron sharing are significantly affected by the characteristics of tasks and examples; (ii) neuron sharing does not fully correspond with language similarity; (iii) shared neurons play a vital role in generating responses, especially those shared across all languages.
arXiv Detail & Related papers (2024-06-13T16:04:11Z) - Sparse Autoencoders Find Highly Interpretable Features in Language
Models [0.0]
Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally.
We use sparse autoencoders to reconstruct the internal activations of a language model.
Our method may serve as a foundation for future mechanistic interpretability work.
arXiv Detail & Related papers (2023-09-15T17:56:55Z) - Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions.
Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z) - New Interpretable Patterns and Discriminative Features from Brain
Functional Network Connectivity Using Dictionary Learning [21.676573007839544]
ICA can identify patterns that can discriminate between healthy controls (HC) and patients with various mental disorders such as schizophrenia (Sz)
dictionary learning (DL) enables the discovery of hidden information in data using learnable basis signals through the use of sparsity.
We present a new method that leverages ICA and DL for the identification of directly interpretable patterns to discriminate between the HC and Sz groups.
arXiv Detail & Related papers (2022-11-10T19:49:16Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Demystifying Neural Language Models' Insensitivity to Word-Order [7.72780997900827]
We investigate the insensitivity of natural language models to word-order by quantifying perturbations.
We find that neural language models require local ordering more so than the global ordering of tokens.
arXiv Detail & Related papers (2021-07-29T13:34:20Z) - What do End-to-End Speech Models Learn about Speaker, Language and
Channel Information? A Layer-wise and Neuron-level Analysis [16.850888973106706]
We conduct a post-hoc functional interpretability analysis of pretrained speech models using the probing framework.
We analyze utterance-level representations of speech models trained for various tasks such as speaker recognition and dialect identification.
Our results reveal several novel findings, including: i) channel and gender information are distributed across the network, ii) the information is redundantly available in neurons with respect to a task, and iv) complex properties such as dialectal information are encoded only in the task-oriented pretrained network.
arXiv Detail & Related papers (2021-07-01T13:32:55Z) - Mechanisms for Handling Nested Dependencies in Neural-Network Language
Models and Humans [75.15855405318855]
We studied whether a modern artificial neural network trained with "deep learning" methods mimics a central aspect of human sentence processing.
Although the network was solely trained to predict the next word in a large corpus, analysis showed the emergence of specialized units that successfully handled local and long-distance syntactic agreement.
We tested the model's predictions in a behavioral experiment where humans detected violations in number agreement in sentences with systematic variations in the singular/plural status of multiple nouns.
arXiv Detail & Related papers (2020-06-19T12:00:05Z) - Information-Theoretic Probing for Linguistic Structure [74.04862204427944]
We propose an information-theoretic operationalization of probing as estimating mutual information.
We evaluate on a set of ten typologically diverse languages often underrepresented in NLP research.
arXiv Detail & Related papers (2020-04-07T01:06:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.