Fugu-MT 論文翻訳(概要): EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

論文の概要: EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

arxiv url: http://arxiv.org/abs/2605.07247v1
Date: Fri, 08 May 2026 05:08:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.808875
Title: EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation
Title（参考訳）: EnvSimBench: LLMに基づく環境シミュレーションの評価と改善のためのベンチマーク
Authors: Yi Liu, TingFeng Hui, Wei Zhang, Li Sun, Ningxin Su, Jian Wang, Sen Su,
Abstract要約: 有望な方向性は、手作業で作り上げた環境を LLM でシミュレートした環境に置き換えることである。 LLMは環境フィードバックを正確にシミュレートすることができます。実際には、LLMシミュレーション環境は幻覚、論理的不整合、サイレントステートドリフト障害に悩まされている。
参考スコア（独自算出の注目度）: 16.36266898493489
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scalable AI agents training relies on interactive environments that faithfully simulate the consequences of agent actions. Manually crafted environments are expensive to build, brittle to extend, and fundamentally limited in diversity. A promising direction is to replace manually crafted environments with LLM-simulated counterparts. However, this paradigm hinges on an unexamined core assumption: LLMs can accurately simulate environmental feedback. In practice, LLM-simulated environments suffer from hallucinations, logical inconsistencies, and silent state drift failures that corrupt agent reward signals and compound the construction costs that the paradigm was designed to eliminate. To address this gap, we propose EnvSimBench with four contributions: 1) We provide the first formal definition and operationalization of Environment Simulation Ability (EnvSim Ability) as a quantifiable research objective. 2) We construct EnvSimBench, a rigorous benchmark covering 400 samples across 167 diverse environments, equipped with verifiable labels and fine-grained difficulty stratification along three axes. 3) Systematic evaluations reveal that all state-of-the-art language models suffer from a universal state change cliff: they achieve near-perfect accuracy on tasks when the environment state remains invariant, yet fail catastrophically when multiple states need simultaneous updates. This finding exposes EnvSim Ability as a critical yet largely unaddressed capability gap. 4) We design a constraint-driven simulation pipeline that substantially reduces hallucination, boosts environment synthesis yield by 6.8%, and cuts costs by over 90%. Overall, EnvSimBench serves as both a diagnostic framework and a practical optimization path for reliable LLM-based environment simulation, establishing a foundation for scalable agent training. Code and data are available at https://github.com/cookieApril/EnvSimBench
Abstract（参考訳）: スケーラブルなAIエージェントのトレーニングは、エージェントアクションの結果を忠実にシミュレートするインタラクティブな環境に依存します。手作りの環境は建設に高価で、拡張に脆弱で、基本的に多様性に制限がある。有望な方向性は、手作業で作り上げた環境を LLM でシミュレートした環境に置き換えることである。 LLMは環境フィードバックを正確にシミュレートすることができる。実際には、LLMシミュレーション環境は幻覚、論理的不整合、サイレントステートドリフト障害に悩まされ、エージェントの報酬信号が破損し、パラダイムが排除するために設計された建設コストが複雑になる。このギャップに対処するため、私たちは4つのコントリビューションでEnvSimBenchを提案します。 1) 環境シミュレーション能力(EnvSim Ability)の定量的研究目的として, 環境シミュレーション能力(EnvSim Ability)の最初の公式定義と運用について述べる。 2) EnvSimBenchは167の異なる環境における400のサンプルをカバーする厳密なベンチマークであり, 検証可能なラベルと3つの軸に沿ったきめ細かな難易度層を有する。 3) 環境状態が不変でありながら,複数の状態に同時更新が必要な場合には破滅的に失敗するタスクに対して,ほぼ完璧な精度を達成できる。この発見は、EnvSim Aabilitiesを重要なものの、ほとんど適応していない機能ギャップとして公開している。 4)ハロゲン化を著しく低減し,環境合成効率を6.8%向上し,コストを90%以上削減する制約駆動型シミュレーションパイプラインを設計する。全体として、EnvSimBenchは、信頼性の高いLLMベースの環境シミュレーションのための診断フレームワークと実用的な最適化パスとして機能し、スケーラブルなエージェントトレーニングの基礎を確立します。コードとデータはhttps://github.com/cookie April/EnvSimBenchで公開されている。

論文の概要: EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

関連論文リスト