Related papers: Linear Representation Transferability Hypothesis: Leveraging Small Models to Steer Large Models

Linear Representation Transferability Hypothesis: Leveraging Small Models to Steer Large Models

URL: http://arxiv.org/abs/2506.00653v2
Date: Tue, 03 Jun 2025 15:52:06 GMT
Title: Linear Representation Transferability Hypothesis: Leveraging Small Models to Steer Large Models
Authors: Femi Bello, Anubrata Das, Fanzhi Zeng, Fangcong Yin, Liu Leqi,
Abstract summary: We show that representations learned across models trained on the same data can be expressed as linear combinations of a emphuniversal set of basis features.<n>These basis features underlie the learning task itself and remain consistent across models, regardless of scale.
Score: 6.390475802910619
License: http://creativecommons.org/licenses/by/4.0/
Abstract: It has been hypothesized that neural networks with similar architectures trained on similar data learn shared representations relevant to the learning task. We build on this idea by extending the conceptual framework where representations learned across models trained on the same data can be expressed as linear combinations of a \emph{universal} set of basis features. These basis features underlie the learning task itself and remain consistent across models, regardless of scale. From this framework, we propose the \textbf{Linear Representation Transferability (LRT)} Hypothesis -- that there exists an affine transformation between the representation spaces of different models. To test this hypothesis, we learn affine mappings between the hidden states of models of different sizes and evaluate whether steering vectors -- directions in hidden state space associated with specific model behaviors -- retain their semantic effect when transferred from small to large language models using the learned mappings. We find strong empirical evidence that such affine mappings can preserve steering behaviors. These findings suggest that representations learned by small models can be used to guide the behavior of large models, and that the LRT hypothesis may be a promising direction on understanding representation alignment across model scales.

Related papers

Transferring Features Across Language Models With Model Stitching [61.24716360332365]
We show that affine mappings between residual streams of language models is a cheap way to transfer represented features between models.<n>We find that small and large models learn similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings.
arXiv Detail & Related papers (2025-06-07T01:03:25Z)
Connecting Neural Models Latent Geometries with Relative Geodesic Representations [21.71782603770616]
We show that when a latent structure is shared between distinct latent spaces, relative distances between representations can be preserved, up to distortions.<n>We assume that distinct neural models parametrize approximately the same underlying manifold, and introduce a representation based on the pullback metric.<n>We validate our method on model stitching and retrieval tasks, covering autoencoders and vision foundation discriminative models.
arXiv Detail & Related papers (2025-06-02T12:34:55Z)
Latent Functional Maps: a spectral framework for representation alignment [34.20582953800544]
We introduce a multi-purpose framework to the representation learning community, which allows to: (i) compare different spaces in an interpretable way and measure their intrinsic similarity; (ii) find correspondences between them, both in unsupervised and weakly supervised settings, and (iii) to effectively transfer representations between distinct spaces. We validate our framework on various applications, ranging from stitching to retrieval tasks, and on multiple modalities, demonstrating that Latent Functional Maps can serve as a swiss-army knife for representation alignment.
arXiv Detail & Related papers (2024-06-20T10:43:28Z)
Causal Estimation of Memorisation Profiles [58.20086589761273]
Understanding memorisation in language models has practical and societal implications. Memorisation is the causal effect of training with an instance on the model's ability to predict that instance. This paper proposes a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics.
arXiv Detail & Related papers (2024-06-06T17:59:09Z)
On the Origins of Linear Representations in Large Language Models [51.88404605700344]
We introduce a simple latent variable model to formalize the concept dynamics of the next token prediction. Experiments show that linear representations emerge when learning from data matching the latent variable model. We additionally confirm some predictions of the theory using the LLaMA-2 large language model.
arXiv Detail & Related papers (2024-03-06T17:17:36Z)
Meaning Representations from Trajectories in Autoregressive Models [106.63181745054571]
We propose to extract meaning representations from autoregressive language models by considering the distribution of all possible trajectories extending an input text. This strategy is prompt-free, does not require fine-tuning, and is applicable to any pre-trained autoregressive model. We empirically show that the representations obtained from large models align well with human annotations, outperform other zero-shot and prompt-free methods on semantic similarity tasks, and can be used to solve more complex entailment and containment tasks that standard embeddings cannot handle.
arXiv Detail & Related papers (2023-10-23T04:35:58Z)
Specify Robust Causal Representation from Mixed Observations [35.387451486213344]
Learning representations purely from observations concerns the problem of learning a low-dimensional, compact representation which is beneficial to prediction models. We develop a learning method to learn such representation from observational data by regularizing the learning procedure with mutual information measures. We theoretically and empirically show that the models trained with the learned causal representations are more robust under adversarial attacks and distribution shifts.
arXiv Detail & Related papers (2023-10-21T02:18:35Z)
Flow Factorized Representation Learning [109.51947536586677]
We introduce a generative model which specifies a distinct set of latent probability paths that define different input transformations. We show that our model achieves higher likelihoods on standard representation learning benchmarks while simultaneously being closer to approximately equivariant models.
arXiv Detail & Related papers (2023-09-22T20:15:37Z)
The Geometry of Self-supervised Learning Models and its Impact on Transfer Learning [62.601681746034956]
Self-supervised learning (SSL) has emerged as a desirable paradigm in computer vision. We propose a data-driven geometric strategy to analyze different SSL models using local neighborhoods in the feature space induced by each.
arXiv Detail & Related papers (2022-09-18T18:15:38Z)
Geometric and Topological Inference for Deep Representations of Complex Networks [13.173307471333619]
We present a class of statistics that emphasize the topology as well as the geometry of representations. We evaluate these statistics in terms of the sensitivity and specificity that they afford when used for model selection. These new methods enable brain and computer scientists to visualize the dynamic representational transformations learned by brains and models.
arXiv Detail & Related papers (2022-03-10T17:14:14Z)
Towards Visually Explaining Similarity Models [29.704524987493766]
We present a method to generate gradient-based visual attention for image similarity predictors. By relying solely on the learned feature embedding, we show that our approach can be applied to any kind of CNN-based similarity architecture. We show that our resulting attention maps serve more than just interpretability; they can be infused into the model learning process itself with new trainable constraints.
arXiv Detail & Related papers (2020-08-13T17:47:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.