From Insight to Action: A Novel Framework for Interpretability-Guided Data Selection in Large Language Models
Abstract Overview
This paper proposes Interpretability-Guided Data Selection (IGDS), a framework that leverages Sparse Autoencoders (SAEs) to identify causally validated task-specific features within large language models and uses them to guide training data selection for fine-tuning. The framework operates in two stages: first, it identifies task features through high-frequency recall followed by interventional filtering for causal validation; second, it scores candidate training examples by their Feature-Resonant Score, which aggregates activation magnitudes of validated task features. IGDS is evaluated on mathematical reasoning, summarization, and machine translation tasks across Gemma-2-2B, LLaMA-3.1-8B, and Qwen3-8B models. Using only 50% of the training data, IGDS outperforms full-dataset fine-tuning in all nine model-task configurations and consistently surpasses compared data selection baselines.
Novelty
The main novelty is using causally validated internal model features, extracted via SAEs and confirmed through targeted activation amplification, as the basis for training data selection rather than relying on external quality or diversity heuristics. The work is also distinctive in explicitly connecting mechanistic interpretability analysis to downstream optimization through a two-stage pipeline of causal feature identification and feature-based data scoring.
Results
Across all nine model-task configurations, IGDS outperforms the compared data selection baselines (Random, Loss, IFD, and ZIP) while using only 50% of the data. It also exceeds full-dataset fine-tuning performance in all settings, with the largest reported gain being +17.4% relative improvement on Gemma-2-2B for the Math task. Ablation studies confirm that both frequency recalling and causal filtering are necessary components, and stability analysis shows that the top-ranked task features persist across different identification source datasets.
Key Points
- IGDS identifies task-relevant features through a coarse-to-fine process: high-frequency recall filters the vast SAE feature space to a small candidate set (often just a few basis points of total features), followed by causal validation through targeted feature amplification.
- Training data are ranked by a Feature-Resonant Score aggregating activation of validated task features, yielding data-efficient fine-tuning that surpasses full-dataset training with only 50% of the data across all tested configurations.
- Stability analysis demonstrates that identified task features (e.g., the top Math feature F_14,11575 for Gemma-2-2B) are consistent across different source datasets, and ablation studies confirm both pipeline stages are necessary for optimal performance.