Fugu-MT 論文翻訳(概要): HumanStudy-Bench: Towards AI Agent Design for Participant Simulation

論文の概要: HumanStudy-Bench: Towards AI Agent Design for Participant Simulation

arxiv url: http://arxiv.org/abs/2602.00685v1
Date: Sat, 31 Jan 2026 12:07:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:33.336473
Title: HumanStudy-Bench: Towards AI Agent Design for Participant Simulation
Title（参考訳）: HumanStudy-Bench:参加者シミュレーションのためのAIエージェント設計を目指して
Authors: Xuan Liu, Haoyang Shang, Zizhang Liu, Xinyan Liu, Yunze Xiao, Yiwen Tu, Haojian Jin,
Abstract要約: 大規模言語モデル (LLMs) は、社会科学実験のシミュレーション参加者としてますます使われている。 HUMANSTUDY-BENCHは、LLMベースのエージェントを編成し、人体実験を再構築するベンチマークおよび実行エンジンである。科学的推論のレベルでの忠実度を評価するために,人間とエージェントの行動がどの程度一致しているかを定量化するための新しい指標を提案する。
参考スコア（独自算出の注目度）: 11.906370453952265
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly used as simulated participants in social science experiments, but their behavior is often unstable and highly sensitive to design choices. Prior evaluations frequently conflate base-model capabilities with experimental instantiation, obscuring whether outcomes reflect the model itself or the agent setup. We instead frame participant simulation as an agent-design problem over full experimental protocols, where an agent is defined by a base model and a specification (e.g., participant attributes) that encodes behavioral assumptions. We introduce HUMANSTUDY-BENCH, a benchmark and execution engine that orchestrates LLM-based agents to reconstruct published human-subject experiments via a Filter--Extract--Execute--Evaluate pipeline, replaying trial sequences and running the original analysis pipeline in a shared runtime that preserves the original statistical procedures end to end. To evaluate fidelity at the level of scientific inference, we propose new metrics to quantify how much human and agent behaviors agree. We instantiate 12 foundational studies as an initial suite in this dynamic benchmark, spanning individual cognition, strategic interaction, and social psychology, and covering more than 6,000 trials with human samples ranging from tens to over 2,100 participants.
Abstract（参考訳）: 大規模言語モデル(LLM)は、社会科学実験のシミュレーション参加者として使われることが多いが、その行動はしばしば不安定であり、設計選択に非常に敏感である。以前の評価では、モデル自体を反映するかエージェントの設定を無視して、実験的なインスタンス化でベースモデル機能を分割することが多かった。そこでは, エージェントを基本モデルと, 動作仮定を符号化した仕様(例えば, 参加者属性)で定義する。 HUMANSTUDY-BENCHは,LLMをベースとしたエージェントを編成し,フィルタ-抽出-実行-評価パイプラインを経由し,試行シーケンスを再生し,元の解析パイプラインを共通ランタイムで実行し,元の統計手順を最後まで保存する。科学的推論のレベルでの忠実度を評価するために,人間とエージェントの行動がどの程度一致しているかを定量化するための新しい指標を提案する。このダイナミックなベンチマークにおいて、12の基礎研究を初期スイートとしてインスタンス化し、個人の認知、戦略的相互作用、社会心理学にまたがり、数十人から2100人以上の被験者による6,000以上の臨床試験をカバーした。

論文の概要: HumanStudy-Bench: Towards AI Agent Design for Participant Simulation

関連論文リスト