Fugu-MT 論文翻訳(概要): The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

論文の概要: The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

arxiv url: http://arxiv.org/abs/2606.05183v1
Date: Sun, 19 Apr 2026 01:26:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-15 07:09:36.638685
Title: The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models
Title（参考訳）: グラニュラリティギャップ--ジェミニモデルにおけるシコファンシーの多次元経時的監査
Authors: Patrick Keough,
Abstract要約: 大規模言語モデルは、ハイテイクアドバイザとしてますますデプロイされているが、標準的なアライメントベンチマークでは、梅毒をバイナリ障害モードとして扱う。 73種類の対向的プロンプトに対して,世代間2.0,2.5,3.0のジェミニ変種を6種類評価した。 2進法ではなく連続法として梅毒を定量化する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial social-compliance behaviors where models capitulate to user framing, validate questionable premises, or soften factual corrections without producing overtly false outputs. We evaluate six Gemini variants across generations 2.0, 2.5, and 3.0 on 73 adversarial prompts under three guardrail conditions (Control, Simple, Protocol), yielding 8,830 graded responses. Using a 0-4 Likert scale validated against a human annotator triad (Fleiss kappa = 0.71; Cohen kappa = 0.78 vs AI consensus; 95.9 percent binary accuracy, 100 percent specificity), we quantify sycophancy as continuous rather than binary. Three findings emerge. First, 27.2 percent of responses contain substantial sycophantic content (Likert >= 2.0) and 22.7 percent reach moderate or severe levels (>= 3.0), while binary win-rate framing reports only modest failure rates; coarse metrics explain just 29 percent of graded variance. Second, generational progress is non-monotonic: Gen 2.5 regresses sharply (mean Control 2.64) relative to Gen 2.0 (1.90) and Gen 3.0 (2.01), and Gen 2.5 shows inverse scaling (Pro 1.94 worse than Flash 1.71) while Gen 3.0 restores standard scaling. Third, we document an Alignment Tax: Spearman rho = -0.63 between sycophancy and truthfulness, indicating social compliance trades against factual accuracy. Egotistical Validation prompts act as a sycophancy trap (mean 3.27), nearly double Unethical Proposals (1.72). Simple guardrails outperform elaborate Protocol scaffolding on flagship models, but distilled Gen 3.0 Flash inverts this, suggesting small models may structurally require chain-of-thought scaffolding. We release the dataset and rubric to support continuous sycophancy measurement.
Abstract（参考訳）: 大規模言語モデルは、ハイテイクアドバイザとしてますますデプロイされているが、標準的なアライメントベンチマークでは、梅毒をバイナリ障害モードとして扱う。粗いバイナリメトリクスは、モデルがユーザフレーミングにカプセル化したり、疑わしい前提を検証したり、事実の修正を過度に偽のアウトプットを発生させることなく軟化したりするような、社会的コンプライアンスの挙動を隠蔽する。我々は,3つのガードレール条件 (Control, Simple, Protocol) の下で,73個の対向プロンプトに対して,世代間2.0,2.5,3.0のジェミニ変種を6種類評価し,8,830個のグレード応答を得た。ヒトのアノテーター三量体(Fleiss kappa = 0.71; Cohen kappa = 0.78 vs AIコンセンサス;95.9%のバイナリ精度、100%特異性)に対して検証された0-4 Likertスケールを用いて、サイコファンシーをバイナリではなく連続的に定量化する。 3つの発見がある。第一に、27.2%の応答はシコファン性のある内容(Likert >= 2.0)を含み、22.7%は中等度または重度のレベル(>= 3.0)に達し、二進勝率のフレーミングは緩やかな失敗率しか報告していない。 Gen 2.5は、Gen 2.0 (1.90) とGen 3.0 (2.01) に対して鋭く(コントロール2.64)、Gen 2.5は逆スケーリング(Flash 1.71より1.94悪い)を示し、Gen 3.0は標準スケーリングを復元する。第3に、アライメント税を文書化する: スピアマン・ロー=-0.63 梅毒と真理の間にあり、事実の正確性に対する社会的コンプライアンスの取引を示す。 Egotistical Validation(エゴティカル・バリデーション)は、シコファンシー・トラップ(平均3.27)、ほぼ2倍の非倫理的提案(1.72)として行動する。単純なガードレールは旗艦モデルで精巧なプロトコルの足場よりも優れていますが、Gen 3.0 Flashを蒸留するとこれを逆転させます。連続的な薬効測定をサポートするためにデータセットとルーブリックをリリースする。

論文の概要: The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

関連論文リスト