Related papers: Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

URL: http://arxiv.org/abs/2506.15538v2
Date: Fri, 20 Jun 2025 13:17:52 GMT
Title: Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework
Authors: Laura Kopf, Nils Feldhus, Kirill Bykov, Philine Lou Bommer, Anna Hedström, Marina M. -C. Höhne, Oliver Eberle,
Abstract summary: We introduce PRISM, a novel framework that captures the inherent complexity of neural network features.<n>Unlike prior approaches that assign a single description per feature, PRISM provides more nuanced descriptions for both polysemantic and monosemantic features.
Score: 7.729065709338261
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated interpretability research aims to identify concepts encoded in neural network features to enhance human understanding of model behavior. Current feature description methods face two critical challenges: limited robustness and the flawed assumption that each neuron encodes only a single concept (monosemanticity), despite growing evidence that neurons are often polysemantic. This assumption restricts the expressiveness of feature descriptions and limits their ability to capture the full range of behaviors encoded in model internals. To address this, we introduce Polysemantic FeatuRe Identification and Scoring Method (PRISM), a novel framework that captures the inherent complexity of neural network features. Unlike prior approaches that assign a single description per feature, PRISM provides more nuanced descriptions for both polysemantic and monosemantic features. We apply PRISM to language models and, through extensive benchmarking against existing methods, demonstrate that our approach produces more accurate and faithful feature descriptions, improving both overall description quality (via a description score) and the ability to capture distinct concepts when polysemanticity is present (via a polysemanticity score).

Related papers

Probing the Vulnerability of Large Language Models to Polysemantic Interventions [49.64902130083662]
We investigate the polysemantic structure of two small models (Pythia-70M and GPT-2-Small)<n>Our analysis reveals a consistent polysemantic topology shared across both models.<n>Strikingly, we demonstrate that this structure can be exploited to mount effective interventions on two larger, black-box instruction-tuned models.
arXiv Detail & Related papers (2025-05-16T18:20:42Z)
Mechanistic understanding and validation of large AI models with SemanticLens [13.712668314238082]
Unlike human-engineered systems such as aeroplanes, the inner workings of AI models remain largely opaque.<n>This paper introduces SemanticLens, a universal explanation method for neural networks that maps hidden knowledge encoded by components.
arXiv Detail & Related papers (2025-01-09T17:47:34Z)
Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness [68.69369585600698]
Deep learning models often suffer from a lack of interpretability due to polysemanticity. Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability. We show that monosemantic features not only enhance interpretability but also bring concrete gains in model performance.
arXiv Detail & Related papers (2024-10-27T18:03:20Z)
MulCPred: Learning Multi-modal Concepts for Explainable Pedestrian Action Prediction [57.483718822429346]
MulCPred is proposed that explains its predictions based on multi-modal concepts represented by training samples. MulCPred is evaluated on multiple datasets and tasks.
arXiv Detail & Related papers (2024-09-14T14:15:28Z)
PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits [12.17671779091913]
We present a method for disentangling polysemanticity of any Deep Neural Network by decomposing a polysemantic neuron into multiple monosemantic "virtual" neurons. We demonstrate how our approach allows us to find and disentangle various polysemantic units of ResNet models trained on ImageNet.
arXiv Detail & Related papers (2024-04-09T16:54:19Z)
Diffexplainer: Towards Cross-modal Global Explanations with Diffusion Models [51.21351775178525]
DiffExplainer is a novel framework that, leveraging language-vision models, enables multimodal global explainability. It employs diffusion models conditioned on optimized text prompts, synthesizing images that maximize class outputs. The analysis of generated visual descriptions allows for automatic identification of biases and spurious features.
arXiv Detail & Related papers (2024-04-03T10:11:22Z)
Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks. The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation. We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z)
Sparse Autoencoders Find Highly Interpretable Features in Language Models [0.0]
Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. We use sparse autoencoders to reconstruct the internal activations of a language model. Our method may serve as a foundation for future mechanistic interpretability work.
arXiv Detail & Related papers (2023-09-15T17:56:55Z)
A Recursive Bateson-Inspired Model for the Generation of Semantic Formal Concepts from Spatial Sensory Data [77.34726150561087]
This paper presents a new symbolic-only method for the generation of hierarchical concept structures from complex sensory data. The approach is based on Bateson's notion of difference as the key to the genesis of an idea or a concept. The model is able to produce fairly rich yet human-readable conceptual representations without training.
arXiv Detail & Related papers (2023-07-16T15:59:13Z)
COCKATIEL: COntinuous Concept ranKed ATtribution with Interpretable ELements for explaining neural net classifiers on NLP tasks [3.475906200620518]
COCKATIEL is a novel, post-hoc, concept-based, model-agnostic XAI technique. It generates meaningful explanations from the last layer of a neural net model trained on an NLP classification task. It does so without compromising the accuracy of the underlying model or requiring a new one to be trained.
arXiv Detail & Related papers (2023-05-11T12:22:20Z)
A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics [131.93113552146195]
We present a new dataset, Handwritten arithmetic with INTegers (HINT), to examine machines' capability of learning generalizable concepts. In HINT, machines are tasked with learning how concepts are perceived from raw signals such as images. We undertake extensive experiments with various sequence-to-sequence models, including RNNs, Transformers, and GPT-3.
arXiv Detail & Related papers (2021-03-02T01:32:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.