EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce
- URL: http://arxiv.org/abs/2512.08868v2
- Date: Thu, 11 Dec 2025 16:38:57 GMT
- Title: EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce
- Authors: Rui Min, Zile Qiao, Ze Xu, Jiawen Zhai, Wenyu Gao, Xuanzhong Chen, Haozhen Sun, Zhen Zhang, Xinyu Wang, Hong Zhou, Wenbiao Yin, Bo Zhang, Xuan Zhou, Ming Yan, Yong Jiang, Haicheng Liu, Liang Ding, Ling Zou, Yi R. Fung, Yalong Li, Pengjun Xie,
- Abstract summary: Foundation agents have rapidly advanced in their ability to reason and interact with real environments.<n>EcomBench is a holistic E-commerce Benchmark designed to evaluate agent performance in realistic e-commerce environments.
- Score: 42.12635793533776
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Foundation agents have rapidly advanced in their ability to reason and interact with real environments, making the evaluation of their core capabilities increasingly important. While many benchmarks have been developed to assess agent performance, most concentrate on academic settings or artificially designed scenarios while overlooking the challenges that arise in real applications. To address this issue, we focus on a highly practical real-world setting, the e-commerce domain, which involves a large volume of diverse user interactions, dynamic market conditions, and tasks directly tied to real decision-making processes. To this end, we introduce EcomBench, a holistic E-commerce Benchmark designed to evaluate agent performance in realistic e-commerce environments. EcomBench is built from genuine user demands embedded in leading global e-commerce ecosystems and is carefully curated and annotated through human experts to ensure clarity, accuracy, and domain relevance. It covers multiple task categories within e-commerce scenarios and defines three difficulty levels that evaluate agents on key capabilities such as deep information retrieval, multi-step reasoning, and cross-source knowledge integration. By grounding evaluation in real e-commerce contexts, EcomBench provides a rigorous and dynamic testbed for measuring the practical capabilities of agents in modern e-commerce.
Related papers
- EComStage: Stage-wise and Orientation-specific Benchmarking for Large Language Models in E-commerce [26.028479108472265]
Large Language Model (LLM)-based agents are increasingly deployed in e-commerce applications.<n>EComStage is a unified benchmark for evaluating agent-capable LLMs across the comprehensive stage-wise reasoning process.<n>We evaluate over 30 LLMs, spanning from 1B to over 200B parameters, including open-source models and closed-source APIs.
arXiv Detail & Related papers (2026-01-06T06:39:16Z) - TongSIM: A General Platform for Simulating Intelligent Machines [59.27575233453533]
Embodied intelligence focuses on training agents within realistic simulated environments.<n>TongSIM is a high-fidelity, general-purpose platform for training and evaluating embodied agents.
arXiv Detail & Related papers (2025-12-23T10:00:43Z) - UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI [2.0619484032730813]
UpBench is a benchmark grounded in real jobs drawn from the global Upwork labor marketplace.<n>Each task corresponds to a verified client transaction, anchoring evaluation in genuine work activity and financial outcomes.<n>UpBench employs a rubric-based evaluation framework, in which expert freelancers decompose each job into detailed, verifiable acceptance criteria and assess AI submissions with per-criterion feedback.
arXiv Detail & Related papers (2025-11-15T17:39:37Z) - WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality [62.43165871914528]
We introduce WebDevJudge, a systematic benchmark for assessing LLM-as-a-judge performance in web development.<n>WebDevJudge comprises human preference labels over paired web implementations, annotated with structured and query-grounded rubrics.<n>In-depth analysis indicates this gap stems from fundamental model limitations, including failures in recognizing functional equivalence, verifying task feasibility, and mitigating bias.
arXiv Detail & Related papers (2025-10-21T12:16:04Z) - Towards General Agentic Intelligence via Environment Scaling [78.66355092082253]
Advanced agentic intelligence is a prerequisite for deploying Large Language Models in real-world applications.<n>We design a scalable framework that automatically constructs heterogeneous environments that are fully simulated.<n>Experiments on agentic benchmarks, tau-bench, tau2-Bench, and ACEBench, demonstrate that our trained model, AgentScaler, significantly enhances the function-calling capability of models.
arXiv Detail & Related papers (2025-09-16T17:57:20Z) - ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues? [13.814769031037526]
We introduce ECom-Bench, the first benchmark framework for evaluating LLM agents with multimodal capabilities in the e-commerce customer support domain.<n>ECom-Bench features dynamic user simulation based on persona information collected from real e-commerce customer interactions and a realistic task dataset derived from authentic e-commerce dialogues.<n>Even advanced models like GPT-4o achieve only a 10-20% pass3 metric in our benchmark, highlighting the substantial difficulties posed by complex e-commerce scenarios.
arXiv Detail & Related papers (2025-07-08T03:35:48Z) - AI-Driven Sentiment Analytics: Unlocking Business Value in the E-Commerce Landscape [0.0]
This paper presents an AI-driven sentiment analysis system designed specifically for e-commerce applications.<n>Our approach integrates traditional machine learning techniques with modern deep learning models, allowing for a more nuanced understanding of customer sentiment.<n> Experimental results show that our system outperforms standard sentiment analysis methods, achieving an accuracy of 89.7% on diverse, large-scale datasets.
arXiv Detail & Related papers (2025-03-20T18:56:22Z) - ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models [15.940958043509463]
We propose textbfChineseEcomQA, a scalable question-answering benchmark focused on fundamental e-commerce concepts.<n> Fundamental concepts are designed to be applicable across a diverse array of e-commerce tasks.<n>By carefully balancing generality and specificity, ChineseEcomQA effectively differentiates between broad e-commerce concepts.
arXiv Detail & Related papers (2025-02-27T15:36:00Z) - WebArena: A Realistic Web Environment for Building Autonomous Agents [92.3291458543633]
We build an environment for language-guided agents that is highly realistic and reproducible.
We focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains.
We release a set of benchmark tasks focusing on evaluating the functional correctness of task completions.
arXiv Detail & Related papers (2023-07-25T22:59:32Z) - Towards Ubiquitous Semantic Metaverse: Challenges, Approaches, and
Opportunities [68.03971716740823]
In recent years, ubiquitous semantic Metaverse has been studied to revolutionize immersive cyber-virtual experiences for augmented reality (AR) and virtual reality (VR) users.
This survey focuses on the representation and intelligence for the four fundamental system components in ubiquitous Metaverse.
arXiv Detail & Related papers (2023-07-13T11:14:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.