Transferring Features Across Language Models With Model Stitching
- URL: http://arxiv.org/abs/2506.06609v2
- Date: Mon, 23 Jun 2025 23:21:57 GMT
- Title: Transferring Features Across Language Models With Model Stitching
- Authors: Alan Chen, Jack Merullo, Alessandro Stolfo, Ellie Pavlick,
- Abstract summary: We show that affine mappings between residual streams of language models is a cheap way to transfer represented features between models.<n>We find that small and large models learn similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings.
- Score: 61.24716360332365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we demonstrate that affine mappings between residual streams of language models is a cheap way to effectively transfer represented features between models. We apply this technique to transfer the weights of Sparse Autoencoders (SAEs) between models of different sizes to compare their representations. We find that small and large models learn similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings. In particular, using a small-to-large transferred SAE as initialization can lead to 50% cheaper training runs when training SAEs on larger models. Next, we show that transferred probes and steering vectors can effectively recover ground truth performance. Finally, we dive deeper into feature-level transferability, finding that semantic and structural features transfer noticeably differently while specific classes of functional features have their roles faithfully mapped. Overall, our findings illustrate similarities and differences in the linear representation spaces of small and large models and demonstrate a method for improving the training efficiency of SAEs.
Related papers
- Linear Representation Transferability Hypothesis: Leveraging Small Models to Steer Large Models [6.390475802910619]
We show that representations learned across models trained on the same data can be expressed as linear combinations of a emphuniversal set of basis features.<n>These basis features underlie the learning task itself and remain consistent across models, regardless of scale.
arXiv Detail & Related papers (2025-05-31T17:45:18Z) - Platonic Grounding for Efficient Multimodal Language Models [22.715168904364756]
We motivate and propose a simple modification to existing multimodal frameworks that rely on aligning pretrained models.<n>Our work also has implications for combining pretrained models into larger systems efficiently.
arXiv Detail & Related papers (2025-04-27T18:56:26Z) - Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions [65.89403417819764]
We quantify the impact of design choices on language model capabilities.<n>By incorporating features besides model size and number of training tokens, we can achieve a relative 3-28% increase in ability to predict downstream performance.
arXiv Detail & Related papers (2025-03-05T19:46:04Z) - Latent Thought Models with Variational Bayes Inference-Time Computation [52.63299874322121]
Latent Thought Models (LTMs) incorporate explicit latent thought vectors that follow an explicit prior model in latent space.<n>LTMs demonstrate superior sample and parameter efficiency compared to autoregressive models and discrete diffusion models.
arXiv Detail & Related papers (2025-02-03T17:50:34Z) - Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network)
After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference.
We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z) - LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - Transferring Knowledge from Large Foundation Models to Small Downstream Models [40.38657103236168]
We introduce Adaptive Feature Transfer (AFT) to transfer knowledge between pre-trained models.
AFT operates purely on features, decoupling the choice of the pre-trained model from the smaller downstream model.
AFT achieves significantly better downstream performance compared to alternatives with a similar computational cost.
arXiv Detail & Related papers (2024-06-11T15:06:15Z) - DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception [66.88792390480343]
We propose DEEM, a simple but effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder.<n>DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data, and a smaller base model size.
arXiv Detail & Related papers (2024-05-24T05:46:04Z) - Beyond Output Matching: Bidirectional Alignment for Enhanced In-Context Learning [39.51220489287151]
Large language models (LLMs) have shown impressive few-shot generalization on many tasks via in-context learning (ICL)<n>We propose Bidirectional Alignment (BiAlign) to fully leverage the models' preferences for ICL examples to improve the ICL abilities of student models.<n>Specifically, we introduce the alignment of input preferences between student and teacher models by incorporating a novel ranking loss.
arXiv Detail & Related papers (2023-12-28T15:02:03Z) - Interpreting CLIP: Insights on the Robustness to ImageNet Distribution Shifts [22.74552390076515]
We probe the representation spaces of 16 robust zero-shot CLIP vision encoders with various backbones and pretraining sets.
We detect the presence of outlier features in robust zero-shot CLIP vision encoders, which to the best of our knowledge is the first time these are observed in non-transformer models.
We find the existence of outlier features to be an indication of ImageNet shift robustness in models, since we only find them in robust models in our analysis.
arXiv Detail & Related papers (2023-10-19T17:59:12Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Compressing Sentence Representation for Semantic Retrieval via
Homomorphic Projective Distillation [28.432799973328127]
We propose Homomorphic Projective Distillation (HPD) to learn compressed sentence embeddings.
Our method augments a small Transformer encoder model with learnable projection layers to produce compact representations.
arXiv Detail & Related papers (2022-03-15T07:05:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.