Related papers: Sparse Interventions in Language Models with Differentiable Masking

Sparse Interventions in Language Models with Differentiable Masking

URL: http://arxiv.org/abs/2112.06837v1
Date: Mon, 13 Dec 2021 17:49:16 GMT
Title: Sparse Interventions in Language Models with Differentiable Masking
Authors: Nicola De Cao, Leon Schmid, Dieuwke Hupkes, Ivan Titov
Abstract summary: We propose a method that discovers within a neural LM a small subset of neurons responsible for a linguistic phenomenon. Our experiments confirm that each of these phenomenons is mediated through a small subset of neurons.
Score: 37.220380160016624
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There has been a lot of interest in understanding what information is captured by hidden representations of language models (LMs). Typically, interpretation methods i) do not guarantee that the model actually uses the encoded information, and ii) do not discover small subsets of neurons responsible for a considered phenomenon. Inspired by causal mediation analysis, we propose a method that discovers within a neural LM a small subset of neurons responsible for a particular linguistic phenomenon, i.e., subsets causing a change in the corresponding token emission probabilities. We use a differentiable relaxation to approximately search through the combinatorial space. An $L_0$ regularization term ensures that the search converges to discrete and sparse solutions. We apply our method to analyze subject-verb number agreement and gender bias detection in LSTMs. We observe that it is fast and finds better solutions than the alternative (REINFORCE). Our experiments confirm that each of these phenomenons is mediated through a small subset of neurons that do not play any other discernible role.

Related papers

Emergent Symbol-like Number Variables in Artificial Neural Networks [34.388552536773034]
We aim to understand how well we can understand Neural Network (NN) solutions through interpretable algorithms. We use GRUs, LSTMs, and Transformers trained using Next Token Prediction (NTP) on numeric tasks. We show through multiple causal and theoretical methods that we can interpret NN's raw activity through the lens of simplified SAs.
arXiv Detail & Related papers (2025-01-10T18:03:46Z)
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models [14.594698598522797]
Demonstrating feature universality allows discoveries about latent representations to generalize across several models. We employ a method known as dictionary learning to transform LLM activations into interpretable spaces spanned by neurons corresponding to individual features. Our experiments reveal significant similarities in SAE feature spaces across various LLMs, providing new evidence for feature universality.
arXiv Detail & Related papers (2024-10-09T15:18:57Z)
Sharing Matters: Analysing Neurons Across Languages and Tasks in LLMs [70.3132264719438]
We aim to fill the research gap by examining how neuron activation is shared across tasks and languages. We classify neurons into four distinct categories based on their responses to a specific input across different languages. Our analysis reveals the following insights: (i) the patterns of neuron sharing are significantly affected by the characteristics of tasks and examples; (ii) neuron sharing does not fully correspond with language similarity; (iii) shared neurons play a vital role in generating responses, especially those shared across all languages.
arXiv Detail & Related papers (2024-06-13T16:04:11Z)
Sparse Autoencoders Find Highly Interpretable Features in Language Models [0.0]
Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. We use sparse autoencoders to reconstruct the internal activations of a language model. Our method may serve as a foundation for future mechanistic interpretability work.
arXiv Detail & Related papers (2023-09-15T17:56:55Z)
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions. Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z)
New Interpretable Patterns and Discriminative Features from Brain Functional Network Connectivity Using Dictionary Learning [21.676573007839544]
ICA can identify patterns that can discriminate between healthy controls (HC) and patients with various mental disorders such as schizophrenia (Sz) dictionary learning (DL) enables the discovery of hidden information in data using learnable basis signals through the use of sparsity. We present a new method that leverages ICA and DL for the identification of directly interpretable patterns to discriminate between the HC and Sz groups.
arXiv Detail & Related papers (2022-11-10T19:49:16Z)
Equivariance Allows Handling Multiple Nuisance Variables When Analyzing Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution. We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z)
Demystifying Neural Language Models' Insensitivity to Word-Order [7.72780997900827]
We investigate the insensitivity of natural language models to word-order by quantifying perturbations. We find that neural language models require local ordering more so than the global ordering of tokens.
arXiv Detail & Related papers (2021-07-29T13:34:20Z)
What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis [16.850888973106706]
We conduct a post-hoc functional interpretability analysis of pretrained speech models using the probing framework. We analyze utterance-level representations of speech models trained for various tasks such as speaker recognition and dialect identification. Our results reveal several novel findings, including: i) channel and gender information are distributed across the network, ii) the information is redundantly available in neurons with respect to a task, and iv) complex properties such as dialectal information are encoded only in the task-oriented pretrained network.
arXiv Detail & Related papers (2021-07-01T13:32:55Z)
Mechanisms for Handling Nested Dependencies in Neural-Network Language Models and Humans [75.15855405318855]
We studied whether a modern artificial neural network trained with "deep learning" methods mimics a central aspect of human sentence processing. Although the network was solely trained to predict the next word in a large corpus, analysis showed the emergence of specialized units that successfully handled local and long-distance syntactic agreement. We tested the model's predictions in a behavioral experiment where humans detected violations in number agreement in sentences with systematic variations in the singular/plural status of multiple nouns.
arXiv Detail & Related papers (2020-06-19T12:00:05Z)
Information-Theoretic Probing for Linguistic Structure [74.04862204427944]
We propose an information-theoretic operationalization of probing as estimating mutual information. We evaluate on a set of ten typologically diverse languages often underrepresented in NLP research.
arXiv Detail & Related papers (2020-04-07T01:06:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.