Fugu-MT 論文翻訳(概要): Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

論文の概要: Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

arxiv url: http://arxiv.org/abs/2605.21006v1
Date: Wed, 20 May 2026 10:43:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.629479
Title: Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy
Title（参考訳）: Devil's Advocate:Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy
Authors: Ishaan Kelkar, Nebras Alam, Vikram Kakaria, Madhur Panwar, Vasu Sharma, Maheep Chaudhary,
Abstract要約: 本研究では,異なるペルソナがtextbfsycophancy に与える影響について検討する。標準緩和(Contrastive Activation Addition、CAA)は、シコファンと正直な反応のラベル付き対から操舵方向を導出する。本研究は、本来はロールプレイングのために開発され、薬局データに基づいて訓練されていない、市販のペルソナステアリングベクターが代替手段として機能するかどうかを評価する。
参考スコア（独自算出の注目度）: 5.645350862501389
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately $68\%$ and $98\%$ of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.
Abstract（参考訳）: 本稿では,異なるペルソナがtextbf{sycophancy}に与える影響について考察する。標準的な緩和であるContrastive Activation Addition (CAA)は、ラベル付きされたシコファンと正直な反応から操舵方向を導出する。本研究は、本来はロールプレイングのために開発され、薬局データに基づいて訓練されていない、市販のペルソナステアリングベクターが代替手段として機能するかどうかを評価する。 2つの命令調整モデルでは、疑念や精査によって特徴づけられるペルソナに対する操舵は、薬効の約6,8\%と9,8\%に減少し、CAAとは異なり、ユーザが正しければ正確性を維持する。この効果は非対称であり、同意可能なペルソナに対する操舵は、筋力の鏡的増加を生じさせない。幾何学的には、ペルソナベクトルは活性化空間におけるシコファンシーの方向とは独立である。以上より, 統合失調症は単方向ではなく, ペルソナレベルの特性として理解されていることが示唆された。 https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/

論文の概要: Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

関連論文リスト