Fugu-MT 論文翻訳(概要): Measuring Opinion Bias and Sycophancy via LLM-based Persuasion

論文の概要: Measuring Opinion Bias and Sycophancy via LLM-based Persuasion

arxiv url: http://arxiv.org/abs/2604.21564v2
Date: Thu, 30 Apr 2026 16:26:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 14:06:12.54848
Title: Measuring Opinion Bias and Sycophancy via LLM-based Persuasion
Title（参考訳）: LLMに基づく説得によるオピニオンバイアスとシクロファンシーの測定
Authors: Rodrigo Nogueira, Giovana Kerche Bonás, Thales Sales Almeida, Andrea Roque, Ramon Pires, Hugo Abonizio, Thiago Laitz, Celio Larcher, Roseval Malaquias Junior, Marcos Piau,
Abstract要約: 提案手法は,提案するトピックに対して,アシスタントが持つ意見を検出する方法である。直接探索は、シミュレーションされたユーザーから圧力をエスカレートする5ターンにわたってモデルの意見を求める。間接的調査は決して意見を求めず、議論的な議論においてモデルを関与させ、それがどのように譲歩し、抵抗し、あるいは反弁論をするかを通してバイアスを漏らす。
参考スコア（独自算出の注目度）: 8.399156116912904
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models increasingly shape the information people consume: they are embedded in search, consulted for professional advice, deployed as agents, and used as a first stop for questions about policy, ethics, health, and politics. When such a model silently holds a position on a contested topic, that position propagates at scale into users' decisions. Eliciting a model's positions is harder than it first appears: contemporary assistants answer direct opinion questions with evasive disclaimers, and the same model may concede the opposite position once the user starts arguing one side. We propose a method, released as the open-source llm-bias-bench, for discovering the opinions an LLM actually holds on contested topics under conditions that resemble real multi-turn interaction. The method pairs two complementary free-form probes. Direct probing asks for the model's opinion across five turns of escalating pressure from a simulated user. Indirect probing never asks for an opinion and engages the model in argumentative debate, letting bias leak through how it concedes, resists, or counter-argues. Three user personas (neutral, agree, disagree) collapse into a nine-way behavioral classification that separates persona-independent positions from persona-dependent sycophancy, and an auditable LLM judge produces verdicts with textual evidence. The first instantiation ships 38 topics in Brazilian Portuguese across values, scientific consensus, philosophy, and economic policy. Applied to 13 assistants, the method surfaces findings of practical interest: argumentative debate triggers sycophancy 2-3x more than direct questioning (median 50% to 79%); models that look opinionated under direct questioning often collapse into mirroring under sustained arguments; and attacker capability matters mainly when an existing opinion must be dislodged, not when the assistant starts neutral.
Abstract（参考訳）: 大きな言語モデルは、人々が消費する情報をますます形作る。それらは検索に埋め込まれ、専門的なアドバイスを求めて相談され、エージェントとしてデプロイされ、政策、倫理、健康、政治に関する質問の第一点として使用される。このようなモデルが、競合するトピックのポジションを静かに保持すると、そのポジションは、ユーザの決定に大きく伝播する。モデルの位置をアクティベートすることは、最初に現れるよりも難しい: 現代のアシスタントは、回避的不服従者による直接的な意見質問に答え、同じモデルが、ユーザーが一方の議論を始めると、反対の位置を判断する。本稿では,lm-bias-benchをオープンソースとしてリリースし,LLMが実際のマルチターンインタラクションに類似した条件下で,競合するトピックに対して実際に保持する意見を発見する手法を提案する。この方法は2つの相補的な自由形プローブをペアリングする。直接探索は、シミュレーションされたユーザーから圧力をエスカレートする5ターンにわたってモデルの意見を求める。間接的調査は決して意見を求めず、議論的な議論においてモデルを関与させ、それがどのように譲歩し、抵抗し、あるいは反弁論をするかを通してバイアスを漏らす。 3人のユーザペルソナ(中立、同意、反対)は、ペルソナ非依存的な位置とペルソナ非依存的な位置を区別する9方向の行動分類に崩壊し、監査可能なLCM判事は、テキストによる証拠で評決を生成する。最初のインスタンス化は、価値、科学的コンセンサス、哲学、経済政策など、ブラジルポルトガル語で38のトピックを出荷している。議論的議論は、直接質問よりも2～3倍多くの梅毒を誘発する(中間50%から79%)、直接質問の下で意見が分かれたモデルは、しばしば持続的な議論の下でミラーリングに崩壊する、そして攻撃能力は、アシスタントが中立な時にではなく、既存の意見が破棄されなければならない場合に主に重要である。

論文の概要: Measuring Opinion Bias and Sycophancy via LLM-based Persuasion

関連論文リスト