Fugu-MT 論文翻訳(概要): RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

論文の概要: RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

arxiv url: http://arxiv.org/abs/2603.16453v1
Date: Tue, 17 Mar 2026 12:35:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.276146
Title: RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments
Title（参考訳）: RetailBench: リアルタイム小売環境におけるLLMエージェントの長期自律意思決定と戦略安定性の評価
Authors: Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang,
Abstract要約: LLM(Large Language Model)ベースのエージェントは、短期的かつ高度に構造化されたタスクにおいて顕著な成功を収めた。 RetailBenchは、現実的な商業シナリオにおいて、長期の自律的な意思決定を評価するために設計された高忠実度ベンチマークである。低レベルの行動実行から高レベルの戦略的推論を分離するEvolving Strategy & Executionフレームワークを提案する。
参考スコア（独自算出の注目度）: 8.751899157366005
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Large Language Model (LLM)-based agents have achieved notable success on short-horizon and highly structured tasks. However, their ability to maintain coherent decision-making over long horizons in realistic and dynamic environments remains an open challenge. We introduce RetailBench, a high-fidelity benchmark designed to evaluate long-horizon autonomous decision-making in realistic commercial scenarios, where agents must operate under stochastic demand and evolving external conditions. We further propose the Evolving Strategy & Execution framework, which separates high-level strategic reasoning from low-level action execution. This design enables adaptive and interpretable strategy evolution over time. It is particularly important for long-horizon tasks, where non-stationary environments and error accumulation require strategies to be revised at a different temporal scale than action execution. Experiments on eight state-of-the-art LLMs across progressively challenging environments show that our framework improves operational stability and efficiency compared to other baselines. However, performance degrades substantially as task complexity increases, revealing fundamental limitations in current LLMs for long-horizon, multi-factor decision-making.
Abstract（参考訳）: LLM(Large Language Model)ベースのエージェントは、短期的かつ高度に構造化されたタスクにおいて顕著な成功を収めた。しかし、現実的でダイナミックな環境において、長い地平線を超えて一貫性のある意思決定を維持する能力は、依然としてオープンな課題である。 RetailBenchは、現実的な商業シナリオにおいて、エージェントが確率的要求の下で動作し、外部条件を進化させなければならない、長期の自律的な意思決定を評価するために設計された高忠実度ベンチマークである。さらに,低レベルの行動実行から高レベルの戦略推論を分離するEvolving Strategy & Executionフレームワークを提案する。この設計は、時間とともに適応的で解釈可能な戦略の進化を可能にする。非定常環境とエラー蓄積は、アクション実行とは異なる時間スケールで修正される戦略を必要とする。高度に挑戦する環境にまたがる8つの最先端のLCM実験により、我々のフレームワークは、他のベースラインと比較して、運用の安定性と効率を向上することを示した。しかし、タスクの複雑さが増すにつれて性能は大幅に低下し、長期・多要素意思決定における現在のLLMの基本的限界が明らかになる。

論文の概要: RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

関連論文リスト