Related papers: Teach Old SAEs New Domain Tricks with Boosting

Teach Old SAEs New Domain Tricks with Boosting

URL: http://arxiv.org/abs/2507.12990v1
Date: Thu, 17 Jul 2025 10:57:49 GMT
Title: Teach Old SAEs New Domain Tricks with Boosting
Authors: Nikita Koriagin, Yaroslav Aksenov, Daniil Laptev, Gleb Gerasimov, Nikita Balagansky, Daniil Gavrilov,
Abstract summary: This paper introduces a residual learning approach that addresses this feature blindness without requiring complete retraining.<n>We propose training a secondary SAE specifically to model the reconstruction error of a pretrained SAE on domain-specific texts.<n>By summing the outputs of both models during inference, we demonstrate significant improvements in both LLM cross-entropy and explained variance metrics.
Score: 3.3865605512957453
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse Autoencoders have emerged as powerful tools for interpreting the internal representations of Large Language Models, yet they often fail to capture domain-specific features not prevalent in their training corpora. This paper introduces a residual learning approach that addresses this feature blindness without requiring complete retraining. We propose training a secondary SAE specifically to model the reconstruction error of a pretrained SAE on domain-specific texts, effectively capturing features missed by the primary model. By summing the outputs of both models during inference, we demonstrate significant improvements in both LLM cross-entropy and explained variance metrics across multiple specialized domains. Our experiments show that this method efficiently incorporates new domain knowledge into existing SAEs while maintaining their performance on general tasks. This approach enables researchers to selectively enhance SAE interpretability for specific domains of interest, opening new possibilities for targeted mechanistic interpretability of LLMs.

Related papers

Addressing Imbalanced Domain-Incremental Learning through Dual-Balance Collaborative Experts [59.615381619866284]
Domain-Incremental Learning (DIL) focuses on continual learning in non-stationary environments.<n>DIL faces two critical challenges in the context of imbalanced data: intra-domain class imbalance and cross-domain class distribution shifts.<n>We introduce the Dual-Balance Collaborative Experts (DCE) framework to overcome these challenges.
arXiv Detail & Related papers (2025-07-09T17:57:07Z)
CSE-SFP: Enabling Unsupervised Sentence Representation Learning via a Single Forward Pass [3.0566617373924325]
Recent advances in pre-trained language models (PLMs) have driven remarkable progress in this field.<n>We propose CSE-SFP, an innovative method that exploits the structural characteristics of generative models.<n>We show that CSE-SFP not only produces higher-quality embeddings but also significantly reduces both training time and memory consumption.
arXiv Detail & Related papers (2025-05-01T08:27:14Z)
Adapting In-Domain Few-Shot Segmentation to New Domains without Retraining [53.963279865355105]
Cross-domain few-shot segmentation (CD-FSS) aims to segment objects of novel classes in new domains.<n>Most CD-FSS methods redesign and retrain in-domain FSS models using various domain-generalization techniques.<n>We propose adapting informative model structures of the well-trained FSS model for target domains by learning domain characteristics from few-shot labeled support samples.
arXiv Detail & Related papers (2025-04-30T08:16:33Z)
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.<n>Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.<n>We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z)
Retaining and Enhancing Pre-trained Knowledge in Vision-Language Models with Prompt Ensembling [5.6987175375687995]
We introduce a novel prompt ensemble learning approach called Group-wise Prompt Ensemble (GPE)<n>Our method aims to enhance CLIP's zero-shot capabilities by incorporating new domain knowledge while improving its robustness against data distribution shifts.<n>Our approach hinges on three main strategies: prompt grouping with masked attention to optimize CLIP's adaptability while safeguarding its zero-shot capabilities; the incorporation of auxiliary prompts for the seamless integration of new domain insights without disrupting the original model's representation; and an ensemble learning strategy that effectively merges original and new knowledge.
arXiv Detail & Related papers (2024-12-10T00:40:31Z)
Unveiling the Vulnerability of Private Fine-Tuning in Split-Based Frameworks for Large Language Models: A Bidirectionally Enhanced Attack [20.727726850786386]
BiSR is the first data reconstruction attack designed to target both the forward and backward propagation processes of split learning (SL) We propose BiSR, the first data reconstruction attack (DRA) designed to target both the forward and backward propagation processes of SL.
arXiv Detail & Related papers (2024-09-02T06:01:20Z)
Investigating Continual Pretraining in Large Language Models: Insights and Implications [9.660013084324817]
Continual learning in large language models (LLMs) is an evolving domain that focuses on developing efficient and sustainable training strategies.<n>We introduce a new benchmark designed to measure the adaptability of LLMs to changing pretraining data landscapes.<n>Our findings uncover several key insights: (i) continual pretraining consistently improves 1.5B models studied in this work and is also superior to domain adaptation, (ii) larger models always achieve better perplexity than smaller ones when continually pretrained on the same corpus, (iii) smaller models are particularly sensitive to continual pretraining, showing the most significant rates of both learning and
arXiv Detail & Related papers (2024-02-27T10:47:24Z)
FusDom: Combining In-Domain and Out-of-Domain Knowledge for Continuous Self-Supervised Learning [54.9235160379917]
FusDom is a simple and novel methodology for SSL-based continued pre-training. FusDom learns speech representations that are robust and adaptive yet not forgetful of concepts seen in the past.
arXiv Detail & Related papers (2023-12-20T13:50:05Z)
Knowledge Plugins: Enhancing Large Language Models for Domain-Specific Recommendations [50.81844184210381]
We propose a general paradigm that augments large language models with DOmain-specific KnowledgE to enhance their performance on practical applications, namely DOKE. This paradigm relies on a domain knowledge extractor, working in three steps: 1) preparing effective knowledge for the task; 2) selecting the knowledge for each specific sample; and 3) expressing the knowledge in an LLM-understandable way.
arXiv Detail & Related papers (2023-11-16T07:09:38Z)
Exploring Complementary Strengths of Invariant and Equivariant Representations for Few-Shot Learning [96.75889543560497]
In many real-world problems, collecting a large number of labeled samples is infeasible. Few-shot learning is the dominant approach to address this issue, where the objective is to quickly adapt to novel categories in presence of a limited number of samples. We propose a novel training mechanism that simultaneously enforces equivariance and invariance to a general set of geometric transformations.
arXiv Detail & Related papers (2021-03-01T21:14:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.