Fugu-MT 論文翻訳(概要): Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment

論文の概要: Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment

arxiv url: http://arxiv.org/abs/2606.01456v1
Date: Sun, 31 May 2026 21:30:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:29.712155
Title: Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment
Title（参考訳）: 真に満ちたAIアドバイザ - 優先順位ミスによる大規模言語モデルの正当性ベンチマーク
Authors: Hamidreza Hasani Balyani, Seyed Pouyan Mousavi Davoudi, Alireza Amiri-Margavi, Amin Gholami Davodi, Arshia Gharagozlou,
Abstract要約: 大規模言語モデルは、ユーザの目的が一致していないアドバイザとして、ますます多くデプロイされる。我々は、標準のクローフォード・ソベルの安価なトークモデルを、好みのミスアライメントの下での誠実さのベンチマークに変換する。最も非形式的な平衡に対して4つのオーバーリベラルが1.8から4.2倍になる。
参考スコア（独自算出の注目度）: 0.8699280339422538
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models are increasingly deployed as advisors whose objective is not aligned with the user's: recommenders optimize for engagement, sales assistants for purchases, negotiation agents for concessions. Whether such advisors stay truthful when honesty conflicts with their own payoff is a core alignment-evaluation question. We turn the canonical Crawford-Sobel cheap-talk model into a pre-specified benchmark for LLM honesty under preference misalignment. Cheap-talk theory predicts neither full revelation nor silence but coarse monotone partitions, with fewer informative intervals as preference conflict grows. A sender observes a state omega in [0,1], wants the receiver's action near omega+b, and sends one costless message to a receiver whose ideal action is omega. The design uses 5 bias levels, 3 prompt frames, a fixed low-temperature setting, and 200 states per cell: 12,000 sender calls. For the positive-bias grid b in {0.01,0.04,0.08,0.12} the exact most-informative partition sizes are 7,4,3,2, with oracle normalized mutual information 0.5294, 0.3268, 0.2205, 0.1829. Running the full design on four instruction-tuned models (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Flash-Lite, Llama-3.3-70B), we find all four over-reveal relative to the most-informative equilibrium by 1.8 to 4.2x: normalized mutual information stays at 0.78-0.94 where the oracle prescribes 0.18-0.53. Informativeness declines with bias as predicted but never approaches the strategic optimum; rather than coarse partitions, models show near-full revelation with a constant upward offset tracking their bias (linear exaggeration). Payoff-maximizing versus honesty framing has negligible effect. A decoder ablation shows the finding is recoverable only when the receiver reads the sender's stated number: an embedding-only decoder mis-reads the same data as near-babbling.
Abstract（参考訳）: 大規模言語モデルは、ユーザの目標と一致していないアドバイザとして、エンゲージメントの最適化、購入のためのセールスアシスタント、譲歩のための交渉エージェントとして、ますます多くデプロイされている。誠実さが彼らの報酬と矛盾する場合、そうしたアドバイザーが真実を守り続けるかどうかは、コアアライメント・アライメント・評価の問題である。我々は、標準のクローフォード・ソベルの安価なトークモデルを、好みのミスアライメントの下で、LLMの誠実さを事前に規定したベンチマークに変換する。チープトーク理論は、完全な啓示も沈黙も予測しないが、粗い単調な分割を予測し、嗜好の対立が増大するにつれて情報的間隔が減る。送信側は[0,1]で状態オメガを観察し、オメガ+b付近で受信側のアクションを希望し、理想的なアクションがオメガである受信側に1つのコストレスメッセージを送信する。設計には5つのバイアスレベル、3つのプロンプトフレーム、固定された低温設定、セルあたり200状態:12,000の送信者呼び出しが使用されている。 0.01,0.04,0.08,0.12} の正バイアス格子 b に対して、最も正確な最も非形式的な分割サイズは7,4,3,2であり、オラクル正規化相互情報 0.5294, 0.3268, 0.2205, 0.1829 である。 4つの命令調整モデル(GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Flash-Lite, Llama-3.3-70B)の完全な設計を実行すると、最も非形式的な平衡値に対して1.8から4.2倍の4つのオーバーリベラルが見つかった。不定形性は予測されるようにバイアスとともに低下するが、戦略的最適に近づくことはない。粗い分割よりも、モデルではバイアスを常に上向きのオフセットで追跡する(線形の誇張)。支払い最大化と正直なフレーミングは無視できる効果がある。デコーダアブレーション(decoder ablation)は、受信者が送信者の記載した番号を読み取った場合にのみ、発見が回復可能であることを示す。

論文の概要: Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment

関連論文リスト