Fugu-MT 論文翻訳(概要): RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

論文の概要: RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

arxiv url: http://arxiv.org/abs/2606.15862v3
Date: Fri, 19 Jun 2026 10:48:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-23 13:41:30.821614
Title: RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments
Title（参考訳）: RetailBench: リアルな小売環境におけるLLMエージェントの長期水平推論とコヒーレントな意思決定のベンチマーク
Authors: Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang,
Abstract要約: 大規模言語モデル (LLM) エージェントは、短時間水平、よく観察されたタスクにおいて急速に進歩してきたが、動的な長距離環境におけるコヒーレントな決定を持続する能力は、いまだに不確実である。 RetailBenchは、単一店舗のスーパーマーケットオペレーションにおいて、ツールを使用するLLMエージェントを評価するためのデータグラウンドシミュレーションベンチマークである。
参考スコア（独自算出の注目度）: 8.751899157366005
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Large language model (LLM) agents have made rapid progress on short-horizon, well-scoped tasks, yet their ability to sustain coherent decisions in dynamic long-horizon environments remains uncertain. We introduce RetailBench, a data-grounded simulation benchmark for evaluating tool-using LLM agents in single-store supermarket operation. RetailBench models retail management as a partially observable decision process and is designed to support thousand-day-scale simulations. In this environment, agents must manage pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash-flow constraints. We evaluate seven contemporary LLMs under representative agent frameworks over a 180-day evaluation horizon and compare them with a privileged oracle policy. Results show substantial variation across models: only a small subset survives the full evaluation horizon, and even the strongest LLM runs remain substantially behind the oracle policy in final net worth and sales outcomes. Behavioral analysis attributes these gaps to incomplete evidence acquisition, surface-level decision making, and the lack of a consistent long-horizon policy. RetailBench provides a controlled testbed for studying reliable autonomy in economically grounded long-horizon decision-making.
Abstract（参考訳）: 大規模言語モデル (LLM) エージェントは, 短時間でよく観察されたタスクにおいて急速に進歩しているが, 動的長期的環境下でのコヒーレントな決定を持続する能力はいまだに不確実である。 RetailBenchは、単一店舗のスーパーマーケットオペレーションにおいて、ツール利用のLLMエージェントを評価するためのデータグラウンドシミュレーションベンチマークである。 RetailBenchは、小売管理を部分的に監視可能な意思決定プロセスとしてモデル化し、数千日規模のシミュレーションをサポートするように設計されている。この環境では、エージェントは価格、補充、サプライヤの選択、棚の品揃え、在庫の老朽化、顧客のフィードバック、外部イベント、キャッシュフローの制約を管理する必要がある。代表的エージェント・フレームワークによる7つの現代LCMを180日間の評価地平線上で評価し,特権的オラクル政策と比較した。小さなサブセットだけが完全な評価の地平線を乗り越え、最強のLCMでさえ、最終的な純価値と販売実績において、オラクル政策のかなり後方に留まっている。行動分析はこれらのギャップを不完全な証拠取得、表面レベルの意思決定、一貫した長期水平政策の欠如に起因している。 RetailBenchは、経済基盤の長期的意思決定において信頼性の高い自律性を研究するための制御されたテストベッドを提供する。

論文の概要: RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

関連論文リスト