Fugu-MT 論文翻訳(概要): Characterizing Model-Native Skills

論文の概要: Characterizing Model-Native Skills

arxiv url: http://arxiv.org/abs/2604.17614v1
Date: Sun, 19 Apr 2026 20:58:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.602273
Title: Characterizing Model-Native Skills
Title（参考訳）: モデルNativeスキルのキャラクタリゼーション
Authors: Feiyang Kang, Mahavir Dabas, Myeongseob Ko, Ruoxi Jia,
Abstract要約: スキルは、言語モデルに何ができるか、その振る舞いをどのように変えられるのかを記述するための自然なユニットである。既存の特徴付けは人書き、テキスト記述、手動プロファイリングパイプラインに依存している。モデルビヘイビアに介入することが目標である場合、スキルの特徴付けは*モデルネイティブ*でなければならない、と我々は主張する。
参考スコア（独自算出の注目度）: 16.891026204025838
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Skills are a natural unit for describing what a language model can do and how its behavior can be changed. However, existing characterizations rely on human-written taxonomies, textual descriptions, or manual profiling pipelines--all external hypotheses about what matters that need not align with the model's internal representations. We argue that when the goal is to intervene on model behavior, skill characterization should be *model-native*: grounded in the model's own representations rather than imposed through external ontologies. We instantiate this view by recovering a compact orthogonal basis from sequence-level activations. The resulting basis is semantically interpretable but need not correspond to any predefined human ontology; instead, it captures axes of behavioral variation that the model itself organizes around. We validate this characterization on reasoning post-training, using the recovered basis for both SFT data selection and inference-time steering. We develop lightweight proxy interventions to identify which directions are most useful for a given model. Across Llama3-8B and Qwen2.5-3B, selecting data along those directions improves Pass@1 by up to 20% on MATH and 41% on AMC, outperforming data selection based on human-characterized skills. Because the basis lives in activation space, the same directions also serve as steering vectors at inference time, improving Pass@8 by up to 4.8% on MATH--an intervention that human-characterized skills cannot support. We further validate the characterization on safety alignment, where selecting adversarial training data for model-native skill coverage rather than textual diversity yields more sample-efficient learning. These results suggest that recovering skills from the model's own representations, rather than imposing them externally, provides a more effective foundation for intervening on model behavior. Codes are open-sourced.
Abstract（参考訳）: スキルは、言語モデルに何ができるか、その振る舞いをどのように変えられるのかを記述するための自然なユニットである。しかし、既存の特徴付けは人間による分類、テキスト記述、手動のプロファイリングパイプラインに依存している。モデルビヘイビアに介入することが目的である場合、スキルの特徴付けは*モデル固有の*:外部オントロジーによって課されるのではなく、モデル自身の表現に基礎を置くべきである、と私たちは主張する。シーケンスレベルのアクティベーションからコンパクトな直交基底を復元することにより、このビューをインスタンス化する。結果として得られる基礎は意味論的に解釈可能であるが、事前に定義された人間のオントロジーに該当する必要はない。本研究では,SFTデータ選択と推論時ステアリングの両手法を用いて,学習後の推論に基づく特徴評価を行った。我々は、与えられたモデルに最も有用な方向を特定するために、軽量なプロキシ介入を開発する。 Llama3-8BとQwen2.5-3B全体では、これらの方向に沿ってデータを選択することで、Pass@1がMATHで最大20%、AMCで41%向上し、人間のキャラクタライズドスキルに基づいたデータ選択よりも優れています。基本は活性化空間にあるため、同じ方向が推論時にステアリングベクターとして機能し、Pass@8がMATHで最大4.8%向上する。さらに,テキストの多様性ではなく,モデルネイティブなスキルカバレッジのための逆トレーニングデータを選択することで,よりサンプル効率のよい学習が可能になる,安全アライメントのキャラクタリゼーションを検証した。これらの結果は、モデルを外部に導入するのではなく、モデル自身の表現からスキルを回復させることが、モデル行動に介入するためのより効果的な基盤となることを示唆している。コードはオープンソースである。

論文の概要: Characterizing Model-Native Skills

関連論文リスト