Related papers: Training Superior Sparse Autoencoders for Instruct Models

Training Superior Sparse Autoencoders for Instruct Models

URL: http://arxiv.org/abs/2506.07691v1
Date: Mon, 09 Jun 2025 12:23:34 GMT
Title: Training Superior Sparse Autoencoders for Instruct Models
Authors: Jiaming Li, Haoran Ye, Yukun Chen, Xinyue Li, Lei Zhang, Hamid Alinejad-Rokny, Jimmy Chih-Hsien Peng, Min Yang,
Abstract summary: We propose a novel training method specifically tailored for instruct models.<n>$textitFAST$ aligns the training process with the data distribution and activation patterns characteristic of instruct models.<n>In feature interpretability, $textitFAST$ yields a higher proportion of high-quality features, for Llama3.2-3B-Instruct, $21.1%$ scored in the top range, compared to $7.0%$ and $10.2%$ for $textitBT(P)$ and $textitBT(F)$.
Score: 16.3663776969074
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) grow in scale and capability, understanding their internal mechanisms becomes increasingly critical. Sparse autoencoders (SAEs) have emerged as a key tool in mechanistic interpretability, enabling the extraction of human-interpretable features from LLMs. However, existing SAE training methods are primarily designed for base models, resulting in reduced reconstruction quality and interpretability when applied to instruct models. To bridge this gap, we propose $\underline{\textbf{F}}$inetuning-$\underline{\textbf{a}}$ligned $\underline{\textbf{S}}$equential $\underline{\textbf{T}}$raining ($\textit{FAST}$), a novel training method specifically tailored for instruct models. $\textit{FAST}$ aligns the training process with the data distribution and activation patterns characteristic of instruct models, resulting in substantial improvements in both reconstruction and feature interpretability. On Qwen2.5-7B-Instruct, $\textit{FAST}$ achieves a mean squared error of 0.6468 in token reconstruction, significantly outperforming baseline methods with errors of 5.1985 and 1.5096. In feature interpretability, $\textit{FAST}$ yields a higher proportion of high-quality features, for Llama3.2-3B-Instruct, $21.1\%$ scored in the top range, compared to $7.0\%$ and $10.2\%$ for $\textit{BT(P)}$ and $\textit{BT(F)}$. Surprisingly, we discover that intervening on the activations of special tokens via the SAEs leads to improvements in output quality, suggesting new opportunities for fine-grained control of model behavior. Code, data, and 240 trained SAEs are available at https://github.com/Geaming2002/FAST.

Related papers

Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models [3.207886496235499]
We study the process through which reasoning models trained with reinforcement learning on verifiable rewards (RLVR) can learn to solve new problems.<n>We find that RLVR drives performance in two main ways: (1) by compressing pass@$k$ into pass@1 and (2) via "capability gain" in which models learn to solve new problems that they previously could not solve even at high $k$.
arXiv Detail & Related papers (2025-06-16T19:03:06Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
The Surprising Effectiveness of Test-Time Training for Few-Shot Learning [59.309477460893916]
Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks.<n>We investigate the effectiveness of test-time training (TTT) as a mechanism for improving LMs' reasoning and few-shot learning capabilities.<n>Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.
arXiv Detail & Related papers (2024-11-11T18:59:45Z)
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z)
Aligning Large Language Models via Self-Steering Optimization [78.42826116686435]
We introduce Self-Steering Optimization ($SSO$), an algorithm that autonomously generates high-quality preference signals. $SSO$ maintains the accuracy of signals by ensuring a consistent gap between chosen and rejected responses. We validate the effectiveness of $SSO$ with two foundation models, Qwen2 and Llama3.1, indicating that it provides accurate, on-policy preference signals.
arXiv Detail & Related papers (2024-10-22T16:04:03Z)
How to Leverage Demonstration Data in Alignment for Large Language Model? A Self-Imitation Learning Perspective [17.956310574300765]
This paper introduces a novel generalized self-imitation learning ($textbfGSIL$) framework. It effectively and efficiently aligns large language models with offline demonstration data. $textbfGSIL$ consistently and significantly outperforms baselines in many challenging benchmarks.
arXiv Detail & Related papers (2024-10-14T02:21:29Z)
Aligning Model Properties via Conformal Risk Control [4.710921988115686]
Post-training alignment via human feedback shows promise, but is often limited to generative AI settings. In traditional non-generative settings with numerical or categorical outputs, detecting misalignment through single-sample outputs remains challenging. We propose interpreting model alignment through property testing, defining an aligned model $f$ as one belonging to a subset $mathcalP$ of functions.
arXiv Detail & Related papers (2024-06-26T22:24:46Z)
Adding Conditional Control to Diffusion Models with Reinforcement Learning [68.06591097066811]
Diffusion models are powerful generative models that allow for precise control over the characteristics of the generated samples.<n>While these diffusion models trained on large datasets have achieved success, there is often a need to introduce additional controls in downstream fine-tuning processes.<n>This work presents a novel method based on reinforcement learning (RL) to add such controls using an offline dataset.
arXiv Detail & Related papers (2024-06-17T22:00:26Z)
Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models [22.425339110551743]
We introduce $textitweak-to-strong search, framing the alignment of a large language model as a test-time greedy search. In controlled-sentiment generation and summarization, we use tuned and untuned $textttgpt2$s to improve the alignment of large models without additional training. In a more difficult instruction-following benchmark, we show that reusing off-the-shelf small models can improve the length-controlled win rates of both white-box and black-box large models.
arXiv Detail & Related papers (2024-05-29T16:55:32Z)
Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training [42.89066583603415]
This work identifies three critical $textitO$bstacles: lack of comprehensive evaluation, ($textitO$2) untested viability for scaling, and ($textitO$3) lack of empirical guidelines. We show that a depthwise stacking operator, called $G_textstack$, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance.
arXiv Detail & Related papers (2024-05-24T08:00:00Z)
Foundation Model's Embedded Representations May Detect Distribution Shift [0.0]
We present a case study for transfer learning tasks on the Sentiment140 dataset. We show that many pre-trained foundation models encode different representations of Sentiment140's manually curated test set $M$ from the automatically labeled training set $P$. We argue training on $P$ and measuring performance on $M$ is a biased measure of generalization.
arXiv Detail & Related papers (2023-10-20T22:20:50Z)
CLAWSAT: Towards Both Robust and Accurate Code Models [74.57590254102311]
We integrate contrastive learning (CL) with adversarial learning to co-optimize the robustness and accuracy of code models. To the best of our knowledge, this is the first systematic study to explore and exploit the robustness and accuracy benefits of (multi-view) code obfuscations in code models.
arXiv Detail & Related papers (2022-11-21T18:32:50Z)
Not All Models Are Equal: Predicting Model Transferability in a Self-challenging Fisher Space [51.62131362670815]
This paper addresses the problem of ranking the pre-trained deep neural networks and screening the most transferable ones for downstream tasks. It proposes a new transferability metric called textbfSelf-challenging textbfFisher textbfDiscriminant textbfAnalysis (textbfSFDA)
arXiv Detail & Related papers (2022-07-07T01:33:25Z)
Improving Robustness and Generality of NLP Models Using Disentangled Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$. We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning. We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.