Fugu-MT 論文翻訳(概要): NARRA-Gym for Evaluating Interactive Narrative Agents

論文の概要: NARRA-Gym for Evaluating Interactive Narrative Agents

arxiv url: http://arxiv.org/abs/2605.08503v1
Date: Fri, 08 May 2026 21:36:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.69051
Title: NARRA-Gym for Evaluating Interactive Narrative Agents
Title（参考訳）: 対話型ナラティブエージェント評価のためのNARRA-Gym
Authors: Yue Huang, Yuchen Ma, Jiayi Ye, Wenjie Wang, Zipeng Ling, Xingjian Hu, Yuexing Hao, Zichen Chen, Zhangchen Xu, Yunhong He, Zhengqing Yuan, Yujun Zhou, Kehan Guo, Chaoran Chen, Toby Jia-Jun Li, Stefan Feuerriegel, Xiangliang Zhang,
Abstract要約: NARRA-Gymは,まばらな感情的シードを完全なインタラクティブな物語のエピソードに変換する評価環境である。我々は,8つのベンチマークペルソナに対して制御されたLLM-as-judgeスイープと,参加者がカスタマイズされたモデル出力を評価できる人的評価を用いて,9つのフロンティアLLMを評価する。
参考スコア（独自算出の注目度）: 69.49891044929372
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Interactive narrative tasks require LLMs to sustain a coherent, evolving story while adapting to a user over multiple turns. However, suitable benchmarks for this setting are limited: existing evaluations often focus on static prompts, isolated story generations, or post-hoc ratings, and therefore miss whether models can jointly manage story generation, long-context state and pacing, character simulation, empathic personalization, and story-grounded artifacts. We introduce NARRA-Gym, an executable evaluation environment that turns a sparse emotional seed into a complete interactive story episode and logs the full model-in-the-loop trajectory, including story construction, memory updates, planning, pacing interventions, and optional artifact synthesis. We evaluate nine frontier LLMs using a controlled LLM-as-judge sweep over eight benchmark personas and a human evaluation in which participants rate customized model outputs. Our results show substantial variation across models, personas, and evaluation dimensions: models that produce fluent stories can still fail on robustness, user experience, or resistance-sensitive personalization. These findings suggest that interactive narrative offers a useful benchmark for evaluating long-horizon, user-adaptive LLM behavior beyond isolated story quality.
Abstract（参考訳）: 対話的なストーリータスクでは、複数のターンにまたがってユーザを適応させながら、一貫性のある進化するストーリーを維持する必要がある。しかし、この設定に適したベンチマークは限られており、既存の評価は静的なプロンプト、孤立したストーリー世代、あるいはポストホックな評価に重点を置いているため、モデルがストーリー生成、長期コンテキストの状態とペーシング、キャラクターシミュレーション、共感的パーソナライゼーション、ストーリーグラウンドのアーティファクトを共同で管理できるかどうかを見逃している。 NARRA-Gymは,スパークな感情的なシードを完全なインタラクティブな物語のエピソードに変換し,ストーリー構築,メモリ更新,計画,ペーシング介入,オプションのアーティファクト合成を含む,ループ内の完全なモデルトラジェクトリをログする実行可能な評価環境である。我々は,8つのベンチマークペルソナに対して制御されたLLM-as-judgeスイープと,参加者がカスタマイズされたモデル出力を評価できる人的評価を用いて,9つのフロンティアLLMを評価する。モデル, ペルソナ, 評価次元の相違点として, 流動的なストーリを生成するモデルは, 堅牢性, ユーザエクスペリエンス, 抵抗性に敏感なパーソナライゼーションにおいて依然として失敗する可能性がある。これらの結果から,対話型ナラティブは,孤立したストーリー品質を超えた長期的ユーザ適応型LCM行動を評価する上で有用な指標であることが示唆された。

論文の概要: NARRA-Gym for Evaluating Interactive Narrative Agents

関連論文リスト