Related papers: Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective

Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective

URL: http://arxiv.org/abs/2406.17969v2
Date: Tue, 15 Oct 2024 22:37:55 GMT
Title: Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective
Authors: Hanqi Yan, Yanzheng Xiang, Guangyi Chen, Yifei Wang, Lin Gui, Yulan He,
Abstract summary: A monosemantic neuron is dedicated to a single and specific concept, which forms a one-to-one correlation between neurons and concepts. Despite extensive research in monosemanticity probing, it remains unclear whether monosemanticity is beneficial or harmful to model capacity.
Score: 30.290777756014748
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To better interpret the intrinsic mechanism of large language models (LLMs), recent studies focus on monosemanticity on its basic units. A monosemantic neuron is dedicated to a single and specific concept, which forms a one-to-one correlation between neurons and concepts. Despite extensive research in monosemanticity probing, it remains unclear whether monosemanticity is beneficial or harmful to model capacity. To explore this question, we revisit monosemanticity from the feature decorrelation perspective and advocate for its encouragement. We experimentally observe that the current conclusion by wang2024learning, which suggests that decreasing monosemanticity enhances model performance, does not hold when the model changes. Instead, we demonstrate that monosemanticity consistently exhibits a positive correlation with model capacity, in the preference alignment process. Consequently, we apply feature correlation as a proxy for monosemanticity and incorporate a feature decorrelation regularizer into the dynamic preference optimization process. The experiments show that our method not only enhances representation diversity and activation sparsity but also improves preference alignment performance.

Related papers

Departures: Distributional Transport for Single-Cell Perturbation Prediction with Neural Schrödinger Bridges [51.83259180910313]
A major bottleneck in gene function analysis is the unpaired nature of single-cell data.<n>We approximate Schrdinger Bridge (SB) to tackle unpaired single-cell perturbation data.<n>Our model effectively captures heterogeneous single-cell responses and achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-11-17T08:27:13Z)
Towards Minimal Causal Representations for Human Multimodal Language Understanding [20.44307628909198]
We introduce a Causal Multimodal Information Bottleneck (CaMIB) model that leverages causal principles rather than traditional likelihood.<n>To ensure global consistency of causal features, we incorporate an instrumental variable constraint.<n>Experiments on multimodal sentiment analysis, humor detection, and sarcasm detection, along with OOD test sets, demonstrate the effectiveness of CaMIB.
arXiv Detail & Related papers (2025-09-26T03:04:23Z)
Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual Conditional Diffusion Implicit Bridges [68.98973318553983]
We propose a framework based on Dual Diffusion Implicit Bridges (DDIB) to learn the mapping between different data distributions.<n>We integrate gene regulatory network (GRN) information to propagate perturbation signals in a biologically meaningful way.<n>We also incorporate a masking mechanism to predict silent genes, improving the quality of generated profiles.
arXiv Detail & Related papers (2025-06-26T09:05:38Z)
Model-free Methods for Event History Analysis and Efficient Adjustment (PhD Thesis) [55.2480439325792]
This thesis is a series of independent contributions to statistics unified by a model-free perspective. The first chapter elaborates on how a model-free perspective can be used to formulate flexible methods that leverage prediction techniques from machine learning. The second chapter studies the concept of local independence, which describes whether the evolution of one process is directly influenced by another.
arXiv Detail & Related papers (2025-02-11T19:24:09Z)
Hallucination, Monofacts, and Miscalibration: An Empirical Investigation [2.6162433502464757]
We show how different underlying data distributions affect the monofact rate and a model's tendency to hallucinate. These findings suggest that both the distribution of fact frequencies in training data and the calibration-hallucination trade-off are inherent to probabilistic language generation.
arXiv Detail & Related papers (2025-02-11T18:46:00Z)
Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness [68.69369585600698]
Deep learning models often suffer from a lack of interpretability due to polysemanticity. Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability. We show that monosemantic features not only enhance interpretability but also bring concrete gains in model performance.
arXiv Detail & Related papers (2024-10-27T18:03:20Z)
Range, not Independence, Drives Modularity in Biologically Inspired Representations [52.48094670415497]
We develop a theory of when biologically inspired networks modularise their representation of source variables (sources) We derive necessary and sufficient conditions on a sample of sources that determine whether the neurons in an optimal linear autoencoder modularise. Our theory applies to any dataset, extending far beyond the case of statistical independence studied in previous work.
arXiv Detail & Related papers (2024-10-08T17:41:37Z)
MonoKAN: Certified Monotonic Kolmogorov-Arnold Network [48.623199394622546]
In certain applications, model predictions must align with expert-imposed requirements, sometimes exemplified by partial monotonicity constraints. We introduce a novel ANN architecture called MonoKAN, based on the KAN architecture and achieves certified partial monotonicity while enhancing interpretability. Our experiments demonstrate that MonoKAN not only enhances interpretability but also improves predictive performance across the majority of benchmarks, outperforming state-of-the-art monotonic approaches.
arXiv Detail & Related papers (2024-09-17T11:10:59Z)
InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion [53.90516061351706]
We present InterHandGen, a novel framework that learns the generative prior of two-hand interaction. For sampling, we combine anti-penetration and synthesis-free guidance to enable plausible generation. Our method significantly outperforms baseline generative models in terms of plausibility and diversity.
arXiv Detail & Related papers (2024-03-26T06:35:55Z)
Confronting Reward Overoptimization for Diffusion Models: A Perspective of Inductive and Primacy Biases [76.9127853906115]
Bridging the gap between diffusion models and human preferences is crucial for their integration into practical generative. We propose Temporal Diffusion Policy Optimization with critic active neuron Reset (TDPO-R), a policy gradient algorithm that exploits the temporal inductive bias of diffusion models. Empirical results demonstrate the superior efficacy of our methods in mitigating reward overoptimization.
arXiv Detail & Related papers (2024-02-13T15:55:41Z)
Learning from Emergence: A Study on Proactively Inhibiting the Monosemantic Neurons of Artificial Neural Networks [10.390475063385756]
We propose a new metric to measure the monosemanticity of neurons with the guarantee of efficiency for online computation. We validate our conjecture that monosemanticity brings about performance change at different model scales.
arXiv Detail & Related papers (2023-12-17T14:42:46Z)
Curve Your Enthusiasm: Concurvity Regularization in Differentiable Generalized Additive Models [5.519653885553456]
Generalized Additive Models (GAMs) have recently experienced a resurgence in popularity due to their interpretability. We show how concurvity can severly impair the interpretability of GAMs. We propose a remedy: a conceptually simple, yet effective regularizer which penalizes pairwise correlations of the non-linearly transformed feature variables.
arXiv Detail & Related papers (2023-05-19T06:55:49Z)
On the Generalization and Adaption Performance of Causal Models [99.64022680811281]
Differentiable causal discovery has proposed to factorize the data generating process into a set of modules. We study the generalization and adaption performance of such modular neural causal models. Our analysis shows that the modular neural causal models outperform other models on both zero and few-shot adaptation in low data regimes.
arXiv Detail & Related papers (2022-06-09T17:12:32Z)
Modeling Implicit Bias with Fuzzy Cognitive Maps [0.0]
This paper presents a Fuzzy Cognitive Map model to quantify implicit bias in structured datasets. We introduce a new reasoning mechanism equipped with a normalization-like transfer function that prevents neurons from saturating.
arXiv Detail & Related papers (2021-12-23T17:04:12Z)
Decomposing Natural Logic Inferences in Neural NLI [9.606462437067984]
We investigate whether neural NLI models capture the crucial semantic features central to natural logic: monotonicity and concept inclusion. We find that monotonicity information is notably weak in the representations of popular NLI models which achieve high scores on benchmarks.
arXiv Detail & Related papers (2021-12-15T17:35:30Z)
Efficient Causal Inference from Combined Observational and Interventional Data through Causal Reductions [68.6505592770171]
Unobserved confounding is one of the main challenges when estimating causal effects. We propose a novel causal reduction method that replaces an arbitrary number of possibly high-dimensional latent confounders. We propose a learning algorithm to estimate the parameterized reduced model jointly from observational and interventional data.
arXiv Detail & Related papers (2021-03-08T14:29:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.