Fugu-MT 論文翻訳(概要): From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement

論文の概要: From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement

arxiv url: http://arxiv.org/abs/2605.14912v1
Date: Thu, 14 May 2026 14:47:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.887772
Title: From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement
Title（参考訳）: サイコファンティック・コンセンサスから多元的修復へ:なぜAIのアライメントが表面の分解に必要か
Authors: Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka,
Abstract要約: 我々はアグリゲーションのみがデプロイされた多重性アライメントの不完全なプリミティブであると主張する。我々は,原則的修正と降伏を区別する指標であるPRS(Pluralistic repair Score)を定式化した。
参考スコア（独自算出の注目度）: 8.459329029609602
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pluralistic alignment is typically operationalised as preference aggregation: producing responses that span (Overton), steer toward (Steerable), or proportionally represent (Distributional) diverse human values. We argue that aggregation alone is an incomplete primitive for deployed pluralistic alignment. Under genuine value pluralism, the failure mode of contemporary RLHF-trained assistants is not insufficient coverage but sycophantic consensus: a learned tendency to agree with, validate, and minimise friction with the immediate interlocutor. Because deployed AI systems now mediate consequential deliberation across health, civic life, labour, and governance, the collapse of disagreement at the interaction layer is not a narrow technical concern but a structural failure with distributive consequences. We reframe pluralistic alignment around three conversational mechanisms drawn from Grice's maxims: scoping (acknowledging the limits of one's perspective), signalling (surfacing value-conflict rather than smoothing it over), and repair (revising one's position on principled grounds, not on user pressure). We formalise a metric, the Pluralistic Repair Score (PRS), distinguishing principled revision from capitulation, and present a small-scale empirical illustration on two frontier RLHF-trained models (Claude Sonnet 4.5, N=198; GPT-4o, N=100) showing that, for both, agreement-following coexists with low repair-quality on contested-value prompts. PRS measures an interactional precondition for pluralism (visible disagreement; principled revision) rather than pluralism in full; we discuss the difference, take seriously the reflexive question of whose "principled" counts, and argue that pluralism is most decisively made or unmade at the deployment-governance layer: interfaces, preference-data pipelines, and audit infrastructure.
Abstract（参考訳）: 複数主義的なアライメントは、通常、好みのアグリゲーションとして運用される: オーバートン(Overton)、ステア(Steerable)、あるいは比例的に多様(Distributional)な人間の価値を表す応答を生成する。我々はアグリゲーションのみがデプロイされた多重性アライメントの不完全なプリミティブであると主張する。真の価値多重性の下では、現代のRLHF訓練アシスタントの障害モードは、カバー範囲が不十分ではなく、サイコファン的コンセンサスである:即時インターロケータとの摩擦に同意し、検証し、最小化する学習傾向である。デプロイされたAIシステムは、健康、市民生活、労働、そしてガバナンスに関する一連の議論を仲介しているため、インタラクション層における意見の不一致の崩壊は、技術的に狭い関心事ではなく、分配的な結果を伴う構造的な失敗である。我々は、Griceの最大値から引き出された3つの会話機構の多元的アライメントを再構成する: スコープ(視点の限界を認識する)、シグナリング(それをスムーズにするのではなく、上向きの値-コンフリクト)、修理(ユーザ圧力ではなく、原則的根拠での位置を変更する)。本稿では,2つのフロンティア RLHF 訓練モデル (Claude Sonnet 4.5, N=198, GPT-4o, N=100) 上で, 両モデルにおいて, 競合する値のプロンプトに対する修復品質の低い共存者を一致追従することを示す。 PRSは、多元論よりも多元論の相互作用前提条件(可視的不一致、原則的修正)をフルに測定し、その相違を議論し、どの「先導的」な数の「先導的」な問いを真に受け、多元論は最も決定的に作られたか、またはデプロイ-支配層(インターフェイス、嗜好データパイプライン、監査インフラ)で未作成であるかを議論する。

論文の概要: From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement

関連論文リスト