Fugu-MT 論文翻訳(概要): Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

論文の概要: Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

arxiv url: http://arxiv.org/abs/2606.12730v1
Date: Wed, 10 Jun 2026 22:28:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 15:55:27.49139
Title: Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior
Title（参考訳）: LLMの心理的評価の再考 : 自己申告の予測行動の時期と理由
Authors: Rafal Kocielnik, Pengrui Han, Peiyang Song, Myrl G. Marmarelis, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez,
Abstract要約: 4つの行動タスクと11のフロンティアLSMで実験を行いました。 SR-ビヘイビアコヒーレンスは存在するが選択的である。これらの結果は、粗いパーソナリティフレームワークがデプロイメントの振る舞いをテストする最良のツールではないことを示唆している。
参考スコア（独自算出の注目度）: 58.538864650160406
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in LLMs, but relied on broad personality traits (Big 5) that predict specific behaviors weakly, even in humans. Furthermore, the isolation of conversational sessions combined with weak context matching left open whether LLMs truly lack coherence or whether the conditions needed to detect such coherence were not met. We contrast Big 5 with the Theory of Planned Behavior (TPB), which measures intention targeted to a specific behavior and predicts human behavior substantially better than broad traits. We run experiments across four behavioral tasks and 11 frontier LLMs, while also varying session context and identity induction. We find that SR-behavior coherence exists but is selective. 1) Within a shared conversation, the Theory of Planned Behavior reaches human-level coherence; Big 5 does not. 2) Across separate conversations, coherence survives only for behaviors anchored outside the immediate prompt, such as implicit bias shaped by training, and collapses when behavior is strongly primed by context, as with sycophancy. 3) Persona prompting makes self-reports more consistent across conversations, but does not bring behavior into alignment. These findings suggest that coarse personality frameworks, such as Big 5 may not be the best tools for testing deployment behavior. More task- and behavior-specific instruments are needed, and even these must be evaluated across tasks and contexts.
Abstract（参考訳）: 低コストの心理測定プローブからLLMの行動傾向を予測することは、安全な展開には重要であるが、自己報告(SR)が確実に行動を予測する場合に限られる。近年の研究では、LLMにおけるSR-振る舞いの解離が記録されているが、人間であっても、特定の振る舞いを弱く予測する幅広い性格特性(Big 5)に依存していた。さらに、会話セッションの分離と弱いコンテキストマッチングは、LLMが本当にコヒーレンスを欠いているのか、あるいはそのようなコヒーレンスを検出するために必要な条件が満たされていないのかを、未解決のまま残した。我々は,ビッグ5と計画行動理論(TPB)の対比を行った。我々は4つの行動タスクと11のフロンティアLDMで実験を行い、セッションコンテキストやアイデンティティの誘導も異なる。 SR-ビヘイビアコヒーレンスは存在するが選択的である。 1)共有会話の中では,計画行動理論が人間レベルのコヒーレンスに達する。 2)個別の会話を通じて,コヒーレンスは,訓練によって形作られた暗黙の偏見や,文脈によって行動が強く優先順位付けされた場合の崩壊など,即時的刺激の外側に固定された行動に対してのみ生存する。 3)ペルソナのプロンプトは,会話間での自己報告をより一貫性を持たせるが,行動の調整は行わない。これらの結果は、Big 5のような粗いパーソナリティフレームワークが、デプロイメントの振る舞いをテストする最良のツールではないことを示唆している。タスクや振る舞い固有の機器がもっと必要であり、それらでさえタスクやコンテキストによって評価されなければならない。

論文の概要: Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

関連論文リスト