Probing the Vulnerability of Large Language Models to Polysemantic Interventions
- URL: http://arxiv.org/abs/2505.11611v1
- Date: Fri, 16 May 2025 18:20:42 GMT
- Title: Probing the Vulnerability of Large Language Models to Polysemantic Interventions
- Authors: Bofan Gong, Shiyang Lai, Dawn Song,
- Abstract summary: We investigate the polysemantic structure of two small models (Pythia-70M and GPT-2-Small)<n>Our analysis reveals a consistent polysemantic topology shared across both models.<n>Strikingly, we demonstrate that this structure can be exploited to mount effective interventions on two larger, black-box instruction-tuned models.
- Score: 49.64902130083662
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Polysemanticity -- where individual neurons encode multiple unrelated features -- is a well-known characteristic of large neural networks and remains a central challenge in the interpretability of language models. At the same time, its implications for model safety are also poorly understood. Leveraging recent advances in sparse autoencoders, we investigate the polysemantic structure of two small models (Pythia-70M and GPT-2-Small) and evaluate their vulnerability to targeted, covert interventions at the prompt, feature, token, and neuron levels. Our analysis reveals a consistent polysemantic topology shared across both models. Strikingly, we demonstrate that this structure can be exploited to mount effective interventions on two larger, black-box instruction-tuned models (LLaMA3.1-8B-Instruct and Gemma-2-9B-Instruct). These findings suggest not only the generalizability of the interventions but also point to a stable and transferable polysemantic structure that could potentially persist across architectures and training regimes.
Related papers
- Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework [7.729065709338261]
We introduce PRISM, a novel framework that captures the inherent complexity of neural network features.<n>Unlike prior approaches that assign a single description per feature, PRISM provides more nuanced descriptions for both polysemantic and monosemantic features.
arXiv Detail & Related papers (2025-06-18T15:13:07Z) - Towards Interpretable Protein Structure Prediction with Sparse Autoencoders [0.0]
Matryoshka SAEs learn hierarchically organized features by forcing nested groups of latents to reconstruct inputs independently.<n>We scale SAEs to ESM2-3B, the base model for ESMFold, enabling mechanistic interpretability of protein structure prediction for the first time.<n>We show that SAEs trained on ESM2-3B significantly outperform those trained on smaller models for both biological concept discovery and contact map prediction.
arXiv Detail & Related papers (2025-03-11T17:57:29Z) - MAMMAL -- Molecular Aligned Multi-Modal Architecture and Language [0.4631438140637248]
MAMMAL is a versatile method applied to create a multi-task foundation model that learns from large-scale biological datasets across diverse modalities.<n> evaluated on eleven diverse downstream tasks, it reaches a new state of the art (SOTA) in nine tasks and is comparable to SOTA in two tasks.<n> explored Alphafold 3 binding prediction capabilities on antibody-antigen and nanobody-antigen complexes showing significantly better classification performance of MAMMAL in 3 out of 4 targets.
arXiv Detail & Related papers (2024-10-28T20:45:52Z) - Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness [68.69369585600698]
Deep learning models often suffer from a lack of interpretability due to polysemanticity.
Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability.
We show that monosemantic features not only enhance interpretability but also bring concrete gains in model performance.
arXiv Detail & Related papers (2024-10-27T18:03:20Z) - Sparse Autoencoders Find Highly Interpretable Features in Language
Models [0.0]
Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally.
We use sparse autoencoders to reconstruct the internal activations of a language model.
Our method may serve as a foundation for future mechanistic interpretability work.
arXiv Detail & Related papers (2023-09-15T17:56:55Z) - RobustMQ: Benchmarking Robustness of Quantized Models [54.15661421492865]
Quantization is an essential technique for deploying deep neural networks (DNNs) on devices with limited resources.
We thoroughly evaluated the robustness of quantized models against various noises (adrial attacks, natural corruptions, and systematic noises) on ImageNet.
Our research contributes to advancing the robust quantization of models and their deployment in real-world scenarios.
arXiv Detail & Related papers (2023-08-04T14:37:12Z) - S3M: Scalable Statistical Shape Modeling through Unsupervised
Correspondences [91.48841778012782]
We propose an unsupervised method to simultaneously learn local and global shape structures across population anatomies.
Our pipeline significantly improves unsupervised correspondence estimation for SSMs compared to baseline methods.
Our method is robust enough to learn from noisy neural network predictions, potentially enabling scaling SSMs to larger patient populations.
arXiv Detail & Related papers (2023-04-15T09:39:52Z) - Interpretability in the Wild: a Circuit for Indirect Object
Identification in GPT-2 small [68.879023473838]
We present an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI)
To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model.
arXiv Detail & Related papers (2022-11-01T17:08:44Z) - Polysemanticity and Capacity in Neural Networks [2.9260206957981167]
Individual neurons in neural networks often represent a mixture of unrelated features.<n>This phenomenon, called polysemanticity, can make interpreting neural networks more difficult.
arXiv Detail & Related papers (2022-10-04T20:28:43Z) - The Causal Neural Connection: Expressiveness, Learnability, and
Inference [125.57815987218756]
An object called structural causal model (SCM) represents a collection of mechanisms and sources of random variation of the system under investigation.
In this paper, we show that the causal hierarchy theorem (Thm. 1, Bareinboim et al., 2020) still holds for neural models.
We introduce a special type of SCM called a neural causal model (NCM), and formalize a new type of inductive bias to encode structural constraints necessary for performing causal inferences.
arXiv Detail & Related papers (2021-07-02T01:55:18Z) - Semi-Structured Distributional Regression -- Extending Structured
Additive Models by Arbitrary Deep Neural Networks and Data Modalities [0.0]
We propose a general framework to combine structured regression models and deep neural networks into a unifying network architecture.
We demonstrate the framework's efficacy in numerical experiments and illustrate its special merits in benchmarks and real-world applications.
arXiv Detail & Related papers (2020-02-13T21:01:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.