Fugu-MT 論文翻訳(概要): SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

論文の概要: SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

arxiv url: http://arxiv.org/abs/2510.17516v1
Date: Mon, 20 Oct 2025 13:14:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:39.460433
Title: SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Title（参考訳）: SimBench: 人間の振る舞いをシミュレートする大規模言語モデルの能力のベンチマーク
Authors: Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Dirk Hovy, Nigel Collier, Paul Röttger,
Abstract要約: 我々は,LLMシミュレーションの堅牢で再現可能な科学のための,最初の大規模標準ベンチマークであるSimBenchを紹介する。現在、最高のLLMでさえシミュレーション能力が限られ(スコア: 40.80/100)、性能はモデルサイズと対数的にスケールする。シミュレーション能力は、深い知識集約的推論と最も強く相関していることを示す。
参考スコア（独自算出の注目度）: 58.87134689752605
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.
Abstract（参考訳）: 人間の行動の大規模言語モデル(LLM)シミュレーションは、実際の人間の行動に忠実に反映している場合に限り、社会科学や行動科学に革命をもたらす可能性がある。現在の評価は、相容れないタスクとメトリクスに基づいて断片化され、相容れない結果のパッチワークが作成されます。そこで本研究では,LLMシミュレーションの堅牢で再現可能な科学のための,最初の大規模標準ベンチマークであるSimBenchを紹介する。道徳的な意思決定から経済的な選択に至るまでのタスクをカバーする20の多様なデータセットを統合することで、SimBench氏はLLMシミュレーションがいつ、どのように、なぜ成功し、失敗するのか、という根本的な疑問を投げかけるために必要な基盤を提供する。現在、最高のLCMでさえシミュレーション能力は限られ(スコア: 40.80/100)が、性能はモデルサイズと対数的にスケールする。推論時間計算の増大によりシミュレーション性能は改善されない。命令チューニングは低エントロピー(コンセンサス)質問のパフォーマンスを向上させるが、高エントロピー(ダイバース)質問では分解する。モデルは特定の人口集団をシミュレートする際に特に苦労する。最後に、シミュレーション能力は、深い知識集約推論(MMLU-Pro, r=0.939)と最も強く相関していることを示す。進捗を計測しやすくすることで、より忠実なLCMシミュレータの開発を加速することを目指している。

論文の概要: SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

関連論文リスト