Fugu-MT 論文翻訳(概要): Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection

論文の概要: Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection

arxiv url: http://arxiv.org/abs/2605.28664v1
Date: Wed, 27 May 2026 15:59:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:56.196173
Title: Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection
Title（参考訳）: 合成データ生成のための活性化ステアリング:下流安全検出における多様性の役割
Authors: Vijeta Deshpande, Tootiya Giyahchi, Veena Padmanabhan, Leman Akoglu, Anna Rumshisky,
Abstract要約: アクティベーションステアリング(AS)は、ターゲット概念に沿った応答を生成するためのデータ効率の手法として登場した。本研究は,4ドルの概念にまたがって,本質的・外生的評価を伴う2次元的な研究である。操舵強度の増大は応答の多様性を低下させる。
参考スコア（独自算出の注目度）: 18.555524134112755
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Safety detection models require examples of HHH (Helpful, Harmless, Honest)-violating outputs for robust generalization, however such examples are scarce. Activation Steering (AS) has emerged as a data-efficient method for generating target-concept-aligned responses. We investigate whether AS can generate high-quality training datasets for downstream classifiers, a question that remains untested. We present a two-fold study with intrinsic and extrinsic evaluation across $4$ concepts $\times\,2$ models $\times\,4$ steering methods. Intrinsically, beyond the field-standard rubric of steering success (concept alignment) and coherence, we introduce sample- and set-level diversity as a quality axis previously absent from the literature, and find that increasing steering strength reduces response diversity. Extrinsically, we replace HHH-violating examples in the available training data with steered generations and fine-tune detection classifiers. AS-generated data results in a better classifier than the prompting-generated data on $3$ of $4$ concepts. However, only $41$ of $136$ AS configurations outperform prompting, indicating that downstream utility lies in a narrow regime that jointly satisfies success, coherence, and diversity. The harmonic mean of these three axes correlates with downstream AUROC more consistently across concepts than success and coherence alone, providing a practical heuristic target for practitioners tuning AS hyperparameters. Together, our results highlight the potential of AS in synthetic data generation for improving safety detection and identify diversity as a critical, previously overlooked axis for tuning AS.
Abstract（参考訳）: 安全性検出モデルは、堅牢な一般化のためにHHH(Helpful, Harmless, Honest)違反出力の例を必要とするが、そのような例は少ない。アクティベーションステアリング(AS)は、ターゲット概念に沿った応答を生成するためのデータ効率の手法として登場した。下流分類器のための高品質なトレーニングデータセットをASが生成できるかどうかについて検討する。我々は,4$概念$\times\,2$モデル$\times\,4$ステアリング手法の内在的および外在的評価を2次的に行う。本質的には、ステアリング成功(コンセプションアライメント)とコヒーレンス(コヒーレンス)のフィールド標準ルーブリックを超えて、文献から逸脱した品質軸としてサンプルレベルおよびセットレベルの多様性を導入し、ステアリング強度の増大が応答多様性を減少させることを示した。極端に、利用可能なトレーニングデータのHHH違反例を、ステアリング世代とファインチューン検出分類器に置き換える。 AS生成データは、$$4$のコンセプトで、プロンプト生成データよりも優れた分類結果をもたらす。しかし、ダウンストリームのユーティリティが成功、一貫性、多様性を両立する狭い体制にあることを示唆している。これら3つの軸の調和平均は、成功とコヒーレンスのみという概念よりも、下流のAUROCと一貫して相関しており、超パラメーターをチューニングする実践者にとって実践的なヒューリスティックな目標となっている。本研究は,ASの安全性を向上する合成データ生成におけるASの可能性を明らかにするとともに,ASをチューニングするための重要な,以前は見過ごされていた軸として多様性を同定するものである。

論文の概要: Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection

関連論文リスト