FuguReport

From Insight to Action: A Novel Framework for Interpretability-Guided Data Selection in Large Language Models

Authors Ling Shi, Xinwei Wu, Xiaohu Zhao, Hao Wang, Heng Liu, Yangyang Liu, Linlong Xu, Longyue Wang, Deyi Xiong, Weihua Luo
Affiliations Tianjin University / Alibaba
Categories Method / Data Selection / Causal task feature identification, Evaluation / Model Evaluation / Evaluation on multiple LLMs, Application / NLP Tasks / Mathematical reasoning and translation
License CC BY 4.0

Abstract Overview

This paper proposes Interpretability-Guided Data Selection (IGDS), a framework that leverages Sparse Autoencoders (SAEs) to identify causally validated task-specific features within large language models and uses them to guide training data selection for fine-tuning. The framework operates in two stages: first, it identifies task features through high-frequency recall followed by interventional filtering for causal validation; second, it scores candidate training examples by their Feature-Resonant Score, which aggregates activation magnitudes of validated task features. IGDS is evaluated on mathematical reasoning, summarization, and machine translation tasks across Gemma-2-2B, LLaMA-3.1-8B, and Qwen3-8B models. Using only 50% of the training data, IGDS outperforms full-dataset fine-tuning in all nine model-task configurations and consistently surpasses compared data selection baselines.

Novelty

The main novelty is using causally validated internal model features, extracted via SAEs and confirmed through targeted activation amplification, as the basis for training data selection rather than relying on external quality or diversity heuristics. The work is also distinctive in explicitly connecting mechanistic interpretability analysis to downstream optimization through a two-stage pipeline of causal feature identification and feature-based data scoring.

Results

Across all nine model-task configurations, IGDS outperforms the compared data selection baselines (Random, Loss, IFD, and ZIP) while using only 50% of the data. It also exceeds full-dataset fine-tuning performance in all settings, with the largest reported gain being +17.4% relative improvement on Gemma-2-2B for the Math task. Ablation studies confirm that both frequency recalling and causal filtering are necessary components, and stability analysis shows that the top-ranked task features persist across different identification source datasets.

Key Points

  1. IGDS identifies task-relevant features through a coarse-to-fine process: high-frequency recall filters the vast SAE feature space to a small candidate set (often just a few basis points of total features), followed by causal validation through targeted feature amplification.
  2. Training data are ranked by a Feature-Resonant Score aggregating activation of validated task features, yielding data-efficient fine-tuning that surpasses full-dataset training with only 50% of the data across all tested configurations.
  3. Stability analysis demonstrates that identified task features (e.g., the top Math feature F_14,11575 for Gemma-2-2B) are consistent across different source datasets, and ablation studies confirm both pipeline stages are necessary for optimal performance.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.