Related papers: Learning a Generative Meta-Model of LLM Activations

Learning a Generative Meta-Model of LLM Activations

URL: http://arxiv.org/abs/2602.06964v1
Date: Fri, 06 Feb 2026 18:59:56 GMT
Title: Learning a Generative Meta-Model of LLM Activations
Authors: Grace Luo, Jiahai Feng, Trevor Darrell, Alec Radford, Jacob Steinhardt,
Abstract summary: We create "meta-models" that learn the distribution of a network's internal states.<n>Applying the meta-model's learned prior to steering interventions improves fluency, with larger gains as loss decreases.<n>These results suggest generative meta-models offer a scalable path toward interpretability without restrictive structural assumptions.
Score: 75.30161960337892
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing approaches for analyzing neural network activations, such as PCA and sparse autoencoders, rely on strong structural assumptions. Generative models offer an alternative: they can uncover structure without such assumptions and act as priors that improve intervention fidelity. We explore this direction by training diffusion models on one billion residual stream activations, creating "meta-models" that learn the distribution of a network's internal states. We find that diffusion loss decreases smoothly with compute and reliably predicts downstream utility. In particular, applying the meta-model's learned prior to steering interventions improves fluency, with larger gains as loss decreases. Moreover, the meta-model's neurons increasingly isolate concepts into individual units, with sparse probing scores that scale as loss decreases. These results suggest generative meta-models offer a scalable path toward interpretability without restrictive structural assumptions. Project page: https://generative-latent-prior.github.io.

Related papers

Toward Reliable Machine Unlearning: Theory, Algorithms, and Evaluation [1.7767466724342065]
We introduce Adrial Machine UNlearning (AMUN), which surpasses prior state-of-the-art methods for image classification based on SOTA MIA scores.<n>We show that existing methods fail in replicating a retrained model's behavior by introducing a nearest-neighbor membership inference attack (MIA-NN)<n>We then propose a fine-tuning objective that mitigates this leakage by approximating, for forget-class inputs, the distribution over remaining classes that a model retrained from scratch would produce.
arXiv Detail & Related papers (2025-12-07T20:57:25Z)
Did Models Sufficient Learn? Attribution-Guided Training via Subset-Selected Counterfactual Augmentation [61.248535801314375]
Subset-Selected Counterfactual Augmentation (SS-CA)<n>We develop Counterfactual LIMA to identify minimal spatial region sets whose removal can selectively alter model predictions.<n>Experiments show that SS-CA improves generalization on in-distribution (ID) test data and achieves superior performance on out-of-distribution (OOD) benchmarks.
arXiv Detail & Related papers (2025-11-15T08:39:22Z)
Model Integrity when Unlearning with T2I Diffusion Models [11.321968363411145]
We propose approximate Machine Unlearning algorithms to reduce the generation of specific types of images, characterized by samples from a forget distribution'' We then propose unlearning algorithms that demonstrate superior effectiveness in preserving model integrity compared to existing baselines.
arXiv Detail & Related papers (2024-11-04T13:15:28Z)
Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions.<n>We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance.<n>Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z)
A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints [87.08677547257733]
Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning. We show how to maximize the likelihood of a symbolic constraint w.r.t the neural network's output distribution. We also evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation.
arXiv Detail & Related papers (2023-12-06T20:58:07Z)
Variational Density Propagation Continual Learning [0.0]
Deep Neural Networks (DNNs) deployed to the real world are regularly subject to out-of-distribution (OoD) data. This paper proposes a framework for adapting to data distribution drift modeled by benchmark Continual Learning datasets.
arXiv Detail & Related papers (2023-08-22T21:51:39Z)
Mechanistic Mode Connectivity [11.772935238948662]
We study neural network loss landscapes through the lens of mode connectivity. We ask the question: are minimizers that rely on different mechanisms for making their predictions connected via simple paths of low loss?
arXiv Detail & Related papers (2022-11-15T18:58:28Z)
On the Generalization and Adaption Performance of Causal Models [99.64022680811281]
Differentiable causal discovery has proposed to factorize the data generating process into a set of modules. We study the generalization and adaption performance of such modular neural causal models. Our analysis shows that the modular neural causal models outperform other models on both zero and few-shot adaptation in low data regimes.
arXiv Detail & Related papers (2022-06-09T17:12:32Z)
Unlabelled Data Improves Bayesian Uncertainty Calibration under Covariate Shift [100.52588638477862]
We develop an approximate Bayesian inference scheme based on posterior regularisation. We demonstrate the utility of our method in the context of transferring prognostic models of prostate cancer across globally diverse populations.
arXiv Detail & Related papers (2020-06-26T13:50:19Z)
GAIT-prop: A biologically plausible learning rule derived from backpropagation of error [9.948484577581796]
We show an exact correspondence between backpropagation and a modified form of target propagation. In a series of simple computer vision experiments, we show near-identical performance between backpropagation and GAIT-prop.
arXiv Detail & Related papers (2020-06-11T13:52:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.