Fugu-MT 論文翻訳(概要): Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients

論文の概要: Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients

arxiv url: http://arxiv.org/abs/2603.14665v1
Date: Sun, 15 Mar 2026 23:39:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.949127
Title: Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients
Title（参考訳）: グラディエント原子:訓練勾配のスパース分解によるモデル行動の発見・属性・ステアリング
Authors: J Rosser,
Abstract要約: トレーニングデータ属性(TDA)メソッドは、モデルの振る舞いにどのトレーニングドキュメントが責任を持つかを問う。この文書単位のフレーミングは、微調整が実際にどのように機能するかと根本的には一致していない、と私たちは主張する。ドキュメントごとのトレーニング勾配をスパースコンポーネントに分解する、教師なしの方法であるGradient Atomsを提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training data attribution (TDA) methods ask which training documents are responsible for a model behavior. We argue that this per-document framing is fundamentally mismatched to how fine-tuning actually works: models often learn broad concepts shared across many examples. Existing TDA methods are supervised -- they require a query behavior, then score every training document against it -- making them both expensive and unable to surface behaviors the user did not think to ask about. We present Gradient Atoms, an unsupervised method that decomposes per-document training gradients into sparse components ("atoms") via dictionary learning in a preconditioned eigenspace. Among the 500 discovered atoms, the highest-coherence ones recover interpretable task-type behaviors -- refusal, arithmetic, yes/no classification, trivia QA -- without any behavioral labels. These atoms double as effective steering vectors: applying them as weight-space perturbations produces large, controllable shifts in model behavior (e.g., bulleted-list generation 33% to 94%; systematic refusal 50% to 0%). The method requires no query--document scoring stage, and scales independently of the number of query behaviors of interest. Code is here: https://github.com/jrosseruk/gradient_atoms
Abstract（参考訳）: トレーニングデータ属性(TDA)メソッドは、モデルの振る舞いにどのトレーニングドキュメントが責任を持つかを問う。このドキュメント単位のフレーミングは、ファインチューニングが実際にどのように機能するかと根本的には一致していない、と私たちは主張する。既存のTDAメソッドは -- クエリの動作が必要で、それに対してすべてのトレーニングドキュメントをスコアする -- 管理されています。本稿では,文書ごとのトレーニング勾配を,事前条件付き固有空間における辞書学習を通じてスパースコンポーネント(原子)に分解する教師なしの手法であるGradient Atomsを提案する。発見された500個の原子のうち、最も高いコヒーレンスな原子は、行動ラベルなしで解釈可能なタスクタイプの挙動(拒絶、算術、イエス/ノー分類、トリビアQA)を回復する。これらの原子は効果的なステアリングベクトルとして二重化され、重み空間の摂動として適用すると、モデル行動の大きな制御可能なシフトが生じる(例えば、弾丸リストの生成は33%から94%、体系的には50%から0%)。この手法は問合せ文書の採点段階を必要とせず、興味のある問合せ行動の数とは無関係にスケールする。コードはここにある。 https://github.com/jrosseruk/gradient_atoms

論文の概要: Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients

関連論文リスト