Fugu-MT 論文翻訳(概要): ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

論文の概要: ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

arxiv url: http://arxiv.org/abs/2605.16116v1
Date: Fri, 15 May 2026 16:00:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 21:22:26.358069
Title: ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents
Title（参考訳）: ShopGym:EコマースWebエージェントのリアルなシミュレーションとスケーラブルなベンチマークのための統合フレームワーク
Authors: Chinmay Savadikar, Mingyu Zhao, Yuanzheng Zhu, Han Li, Shuang Xie, Alberto Castelo, Tianfu Wu, Lingyun Wang,
Abstract要約: ShopGymは、eコマースウェブエージェントの現実的なシミュレーションとスケーラブルなベンチマークのための統合フレームワークである。 ShopArenaは、店舗仕様とステージ化された検証された生成プロセスを通じて、実店舗を自己完結型のサンドボックスショップに変換する。 ShopGuruは7つのスキルカテゴリのベンチマークタスクを合成し、各タスクを店のカタログ、ナビゲーション構造、ポリシー、インタラクション能力に基盤を置く。
参考スコア（独自算出の注目度）: 12.399936351655917
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Developing and evaluating e-commerce web agents requires environments that preserve meaningful task structure while enabling controllable, reproducible, and scalable scientific comparison. Existing methodologies force a tradeoff: live storefronts provide realism but are non-stationary, difficult to inspect, and irreproducible, while hand-built sandbox benchmarks provide control but cover only a narrow range of layouts, catalogs, policies, and interaction patterns. We argue that the core bottleneck is methodological: the field lacks a scalable way to construct evaluation settings that are simultaneously realistic, diverse, controllable, inspectable, and reproducible. We introduce ShopGym, an integrated framework for realistic simulation and scalable benchmarking of e-commerce web agents. ShopGym is a framework for constructing e-commerce simulation environments and grounded benchmark tasks. Its simulation layer, ShopArena, converts live seed storefronts into self-contained sandbox shops through anonymized shop specifications and a staged, validated generation process. On top of these simulated storefronts, ShopGuru synthesizes benchmark tasks across seven skill categories, grounding each task in the shop's catalog, navigation structure, policies, and interaction affordances. Together, ShopArena and ShopGuru produce self-contained, resettable, inspectable, and stable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. We validate the framework through graph-based structural analysis and agent-based behavioral evaluation with 224 generated tasks across six sandbox shops: three constructed with synthetic data and three with real data. Our results show that the synthetic shops preserve key structural properties of live storefronts, with agent performance on synthetic shops positively correlated with performance on live storefronts.
Abstract（参考訳）: eコマースのウェブエージェントの開発と評価には、意味のあるタスク構造を維持しながら、制御可能で再現可能でスケーラブルな科学的比較を可能にする環境が必要である。ライブストアフロントは現実主義を提供するが、非定常的で、検査が困難で、再現不可能であるのに対して、手作りのサンドボックスベンチマークはコントロールを提供するが、限られた範囲のレイアウト、カタログ、ポリシー、インタラクションパターンのみをカバーする。フィールドには、同時に現実的で、多様性があり、制御可能で、検査可能で、再現可能な評価設定を構築するためのスケーラブルな方法がありません。本稿では,eコマースWebエージェントの現実的なシミュレーションとスケーラブルなベンチマークのための統合フレームワークであるShopGymを紹介する。 ShopGymは、Eコマースシミュレーション環境とベンチマークタスクを構築するためのフレームワークである。シミュレーション層であるShopArenaは、生のシードストアを、匿名のショップ仕様とステージ化された検証された生成プロセスを通じて、自己完結型のサンドボックスショップに変換する。これらのシミュレートされたストアフロントに加えて、ShopGuruは7つのスキルカテゴリのベンチマークタスクを合成し、各タスクを店のカタログ、ナビゲーション構造、ポリシー、インタラクション能力に基礎を置いている。 ShopArenaとShopGuruは共に、自己完結型、再設定可能、検査可能、安定した評価成果物を作成し、構造的特性とショッピングタスクに関連するエージェント評価信号を保存する。グラフに基づく構造解析とエージェントによる行動評価を,6つのサンドボックスショップで224個のタスクを生成し,その3つは合成データで構築され,3つは実データで構築された。以上の結果から, 合成店舗は, 生店舗における重要な構造特性を保ち, 生店舗におけるエージェント性能は生店舗におけるパフォーマンスと正の相関を示した。

論文の概要: ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

関連論文リスト