Signal in the Noise: Polysemantic Interference Transfers and Predicts Cross-Model Influence
- URL: http://arxiv.org/abs/2505.11611v2
- Date: Mon, 29 Sep 2025 16:04:15 GMT
- Title: Signal in the Noise: Polysemantic Interference Transfers and Predicts Cross-Model Influence
- Authors: Bofan Gong, Shiyang Lai, James Evans, Dawn Song,
- Abstract summary: Polysemanticity is pervasive in language models and remains a major challenge for interpretation and model behavioral control.<n>We map the polysemantic topology of two small models to identify feature pairs that are semantically unrelated yet exhibit interference within models.<n>We intervene at four loci (prompt, token, feature, neuron) and measure induced shifts in the next-token prediction distribution, uncovering polysemantic structures that expose a systematic vulnerability in these models.
- Score: 46.548276232795466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Polysemanticity is pervasive in language models and remains a major challenge for interpretation and model behavioral control. Leveraging sparse autoencoders (SAEs), we map the polysemantic topology of two small models (Pythia-70M and GPT-2-Small) to identify SAE feature pairs that are semantically unrelated yet exhibit interference within models. We intervene at four loci (prompt, token, feature, neuron) and measure induced shifts in the next-token prediction distribution, uncovering polysemantic structures that expose a systematic vulnerability in these models. Critically, interventions distilled from counterintuitive interference patterns shared by two small models transfer reliably to larger instruction-tuned models (Llama-3.1-8B/70B-Instruct and Gemma-2-9B-Instruct), yielding predictable behavioral shifts without access to model internals. These findings challenge the view that polysemanticity is purely stochastic, demonstrating instead that interference structures generalize across scale and family. Such generalization suggests a convergent, higher-order organization of internal representations, which is only weakly aligned with intuition and structured by latent regularities, offering new possibilities for both black-box control and theoretical insight into human and artificial cognition.
Related papers
- D-Models and E-Models: Diversity-Stability Trade-offs in the Sampling Behavior of Large Language Models [91.21455683212224]
In large language models (LLMs), the probability of relevance for the next piece of information is linked to the probability of relevance for the next product.<n>But whether fine-grained sampling probabilities faithfully align with task requirements remains an open question.<n>We identify two model types: D-models, whose P_token exhibits large step-to-step variability and poor alignment with P_task; and E-models, whose P_token is more stable and better aligned with P_task.
arXiv Detail & Related papers (2026-01-25T14:59:09Z) - Explaining Machine Learning Predictive Models through Conditional Expectation Methods [0.0]
MUCE is a model-agnostic method for local explainability designed to capture prediction changes from feature interactions.<n>Two quantitative indices, stability and uncertainty, summarize local behavior and assess model reliability.<n>Results show that MUCE effectively captures complex local model behavior, while the stability and uncertainty indices provide meaningful insight into prediction confidence.
arXiv Detail & Related papers (2026-01-12T08:34:36Z) - Understanding Overparametrization in Survival Models through Interpolation [14.444096460952961]
Recent advances in machine learning have revealed a more complex pattern, textitdouble-descent, in which test loss, after peaking near the threshold, decreases again as model capacity continues to grow.<n>This study investigates overparametrization in four representative survival models: DeepSurv, PC-Hazard, Nnet-Survival, and N-MTLR.
arXiv Detail & Related papers (2025-12-13T21:23:02Z) - Concept-SAE: Active Causal Probing of Visual Model Behavior [10.346577706023139]
Concept-SAE is a framework that forges semantically grounded concept tokens through a novel hybrid disentanglement strategy.<n>We first quantitatively demonstrate that our dual-supervision approach produces tokens that are remarkably faithful and spatially localized.<n>This validated fidelity enables two critical applications: (1) we probe the causal link between internal concepts and predictions via direct intervention, and (2) we probe the model's failure modes by systematically localizing adversarial vulnerabilities to specific layers.
arXiv Detail & Related papers (2025-09-26T07:51:03Z) - From Patterns to Predictions: A Shapelet-Based Framework for Directional Forecasting in Noisy Financial Markets [8.168261768703621]
Directional forecasting in financial markets requires both accuracy and interpretability.<n>We propose a two-stage framework that integrates unsupervised pattern extracion with interpretable forecasting.<n>Our approach enables transparent decision-making by revealing the underlying pattern structures that drive predictive outcomes.
arXiv Detail & Related papers (2025-09-18T15:05:27Z) - Persona Features Control Emergent Misalignment [9.67070289452428]
We show that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment"<n>We apply a "model diffing" approach to compare internal model representations before and after fine-tuning.<n>We also investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.
arXiv Detail & Related papers (2025-06-24T17:38:21Z) - Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework [7.729065709338261]
We introduce PRISM, a novel framework that captures the inherent complexity of neural network features.<n>Unlike prior approaches that assign a single description per feature, PRISM provides more nuanced descriptions for both polysemantic and monosemantic features.
arXiv Detail & Related papers (2025-06-18T15:13:07Z) - Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors [61.92704516732144]
We show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior.<n>We propose two methods that leverage causal mechanisms to predict the correctness of model outputs.
arXiv Detail & Related papers (2025-05-17T00:31:39Z) - Towards Interpretable Protein Structure Prediction with Sparse Autoencoders [0.0]
Matryoshka SAEs learn hierarchically organized features by forcing nested groups of latents to reconstruct inputs independently.<n>We scale SAEs to ESM2-3B, the base model for ESMFold, enabling mechanistic interpretability of protein structure prediction for the first time.<n>We show that SAEs trained on ESM2-3B significantly outperform those trained on smaller models for both biological concept discovery and contact map prediction.
arXiv Detail & Related papers (2025-03-11T17:57:29Z) - MAMMAL -- Molecular Aligned Multi-Modal Architecture and Language [0.4631438140637248]
MAMMAL is a versatile method applied to create a multi-task foundation model that learns from large-scale biological datasets across diverse modalities.<n> evaluated on eleven diverse downstream tasks, it reaches a new state of the art (SOTA) in nine tasks and is comparable to SOTA in two tasks.<n> explored Alphafold 3 binding prediction capabilities on antibody-antigen and nanobody-antigen complexes showing significantly better classification performance of MAMMAL in 3 out of 4 targets.
arXiv Detail & Related papers (2024-10-28T20:45:52Z) - Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness [68.69369585600698]
Deep learning models often suffer from a lack of interpretability due to polysemanticity.
Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability.
We show that monosemantic features not only enhance interpretability but also bring concrete gains in model performance.
arXiv Detail & Related papers (2024-10-27T18:03:20Z) - Sparse Autoencoders Find Highly Interpretable Features in Language
Models [0.0]
Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally.
We use sparse autoencoders to reconstruct the internal activations of a language model.
Our method may serve as a foundation for future mechanistic interpretability work.
arXiv Detail & Related papers (2023-09-15T17:56:55Z) - RobustMQ: Benchmarking Robustness of Quantized Models [54.15661421492865]
Quantization is an essential technique for deploying deep neural networks (DNNs) on devices with limited resources.
We thoroughly evaluated the robustness of quantized models against various noises (adrial attacks, natural corruptions, and systematic noises) on ImageNet.
Our research contributes to advancing the robust quantization of models and their deployment in real-world scenarios.
arXiv Detail & Related papers (2023-08-04T14:37:12Z) - S3M: Scalable Statistical Shape Modeling through Unsupervised
Correspondences [91.48841778012782]
We propose an unsupervised method to simultaneously learn local and global shape structures across population anatomies.
Our pipeline significantly improves unsupervised correspondence estimation for SSMs compared to baseline methods.
Our method is robust enough to learn from noisy neural network predictions, potentially enabling scaling SSMs to larger patient populations.
arXiv Detail & Related papers (2023-04-15T09:39:52Z) - Interpretability in the Wild: a Circuit for Indirect Object
Identification in GPT-2 small [68.879023473838]
We present an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI)
To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model.
arXiv Detail & Related papers (2022-11-01T17:08:44Z) - Polysemanticity and Capacity in Neural Networks [2.9260206957981167]
Individual neurons in neural networks often represent a mixture of unrelated features.<n>This phenomenon, called polysemanticity, can make interpreting neural networks more difficult.
arXiv Detail & Related papers (2022-10-04T20:28:43Z) - The Causal Neural Connection: Expressiveness, Learnability, and
Inference [125.57815987218756]
An object called structural causal model (SCM) represents a collection of mechanisms and sources of random variation of the system under investigation.
In this paper, we show that the causal hierarchy theorem (Thm. 1, Bareinboim et al., 2020) still holds for neural models.
We introduce a special type of SCM called a neural causal model (NCM), and formalize a new type of inductive bias to encode structural constraints necessary for performing causal inferences.
arXiv Detail & Related papers (2021-07-02T01:55:18Z) - Understanding Neural Abstractive Summarization Models via Uncertainty [54.37665950633147]
seq2seq abstractive summarization models generate text in a free-form manner.
We study the entropy, or uncertainty, of the model's token-level predictions.
We show that uncertainty is a useful perspective for analyzing summarization and text generation models more broadly.
arXiv Detail & Related papers (2020-10-15T16:57:27Z) - Structural Causal Models Are (Solvable by) Credal Networks [70.45873402967297]
Causal inferences can be obtained by standard algorithms for the updating of credal nets.
This contribution should be regarded as a systematic approach to represent structural causal models by credal networks.
Experiments show that approximate algorithms for credal networks can immediately be used to do causal inference in real-size problems.
arXiv Detail & Related papers (2020-08-02T11:19:36Z) - Semi-Structured Distributional Regression -- Extending Structured
Additive Models by Arbitrary Deep Neural Networks and Data Modalities [0.0]
We propose a general framework to combine structured regression models and deep neural networks into a unifying network architecture.
We demonstrate the framework's efficacy in numerical experiments and illustrate its special merits in benchmarks and real-world applications.
arXiv Detail & Related papers (2020-02-13T21:01:26Z) - Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification.
This paper studies a complementary failure mode, invariance-based adversarial examples.
We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.