Related papers: RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

URL: http://arxiv.org/abs/2402.17700v2
Date: Mon, 26 Aug 2024 19:26:06 GMT
Title: RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Authors: Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, Atticus Geiger,
Abstract summary: We introduce RAVEL, a dataset that enables tightly controlled, quantitative comparisons between interpretability methods. We use the resulting conceptual framework to define the new method of Multi-task Distributed Alignment Search. With Llama2-7B as the target language model, MDAS achieves state-of-the-art results on RAVEL.
Score: 38.79058788596755
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Individual neurons participate in the representation of multiple high-level concepts. To what extent can different interpretability methods successfully disentangle these roles? To help address this question, we introduce RAVEL (Resolving Attribute-Value Entanglements in Language Models), a dataset that enables tightly controlled, quantitative comparisons between a variety of existing interpretability methods. We use the resulting conceptual framework to define the new method of Multi-task Distributed Alignment Search (MDAS), which allows us to find distributed representations satisfying multiple causal criteria. With Llama2-7B as the target language model, MDAS achieves state-of-the-art results on RAVEL, demonstrating the importance of going beyond neuron-level analyses to identify features distributed across activations. We release our benchmark at https://github.com/explanare/ravel.

Related papers

Multiple Choice Learning of Low Rank Adapters for Language Modeling [40.380297530862656]
We propose LoRA-MCL, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time.<n>We demonstrate with extensive experiments on real-world visual and audio captioning tasks that our method achieves high diversity and relevance in generated outputs.
arXiv Detail & Related papers (2025-07-14T16:00:51Z)
The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs [54.59207567677249]
Large language models (LLMs) still struggle across tasks outside of high-resource languages.<n>In this work, we investigate cross-lingual transfer to lower-resource languages where task-specific post-training data is scarce.
arXiv Detail & Related papers (2025-05-23T20:28:31Z)
The Multi-Faceted Monosemanticity in Multimodal Representations [42.64636740703632]
We leverage recent advancements in feature monosemanticity to extract interpretable features from deep multimodal models. Our findings reveal that this categorization aligns closely with human cognitive understandings of different modalities. These results indicate that large-scale multimodal models, equipped with task-agnostic interpretability tools, offer valuable insights into key connections and distinctions between different modalities.
arXiv Detail & Related papers (2025-02-16T14:51:07Z)
The Complexity of Learning Sparse Superposed Features with Feedback [0.9838799448847586]
We investigate whether the underlying learned features of a model can be efficiently retrieved through feedback from an agent. We analyze the feedback complexity associated with learning a feature matrix in sparse settings. Our results establish tight bounds when the agent is permitted to construct activations and demonstrate strong upper bounds in sparse scenarios.
arXiv Detail & Related papers (2025-02-08T01:54:23Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks. We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval. This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z)
MINERS: Multilingual Language Models as Semantic Retrievers [23.686762008696547]
This paper introduces the MINERS, a benchmark designed to evaluate the ability of multilingual language models in semantic retrieval tasks. We create a comprehensive framework to assess the robustness of LMs in retrieving samples across over 200 diverse languages. Our results demonstrate that by solely retrieving semantically similar embeddings yields performance competitive with state-of-the-art approaches.
arXiv Detail & Related papers (2024-06-11T16:26:18Z)
Multitasking Models are Robust to Structural Failure: A Neural Model for Bilingual Cognitive Reserve [78.3500985535601]
We find a surprising connection between multitask learning and robustness to neuron failures. Our experiments show that bilingual language models retain higher performance under various neuron perturbations. We provide a theoretical justification for this robustness by mathematically analyzing linear representation learning.
arXiv Detail & Related papers (2022-10-20T22:23:27Z)
Retrofitting Multilingual Sentence Embeddings with Abstract Meaning Representation [70.58243648754507]
We introduce a new method to improve existing multilingual sentence embeddings with Abstract Meaning Representation (AMR) Compared with the original textual input, AMR is a structured semantic representation that presents the core concepts and relations in a sentence explicitly and unambiguously. Experiment results show that retrofitting multilingual sentence embeddings with AMR leads to better state-of-the-art performance on both semantic similarity and transfer tasks.
arXiv Detail & Related papers (2022-10-18T11:37:36Z)
Unsupervised Multimodal Language Representations using Convolutional Autoencoders [5.464072883537924]
We propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks. We map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets. It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters.
arXiv Detail & Related papers (2021-10-06T18:28:07Z)
Incorporating Linguistic Knowledge for Abstractive Multi-document Summarization [20.572283625521784]
We develop a neural network based abstractive multi-document summarization (MDS) model. We process the dependency information into the linguistic-guided attention mechanism. With the help of linguistic signals, sentence-level relations can be correctly captured.
arXiv Detail & Related papers (2021-09-23T08:13:35Z)
An Investigation of Language Model Interpretability via Sentence Editing [5.492504126672887]
We re-purpose a sentence editing dataset as a testbed for interpretability of pre-trained language models (PLMs) This enables us to conduct a systematic investigation on an array of questions regarding PLMs' interpretability. The investigation generates new insights, for example, contrary to the common understanding, we find that attention weights correlate well with human rationales.
arXiv Detail & Related papers (2020-11-28T00:46:43Z)
Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task. The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them. By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.