ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?
- URL: http://arxiv.org/abs/2507.05639v1
- Date: Tue, 08 Jul 2025 03:35:48 GMT
- Title: ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?
- Authors: Haoxin Wang, Xianhan Peng, Xucheng Huang, Yizhe Huang, Ming Gong, Chenghan Yang, Yang Liu, Ling Jiang,
- Abstract summary: We introduce ECom-Bench, the first benchmark framework for evaluating LLM agents with multimodal capabilities in the e-commerce customer support domain.<n>ECom-Bench features dynamic user simulation based on persona information collected from real e-commerce customer interactions and a realistic task dataset derived from authentic e-commerce dialogues.
- Score: 20.83383124467603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce ECom-Bench, the first benchmark framework for evaluating LLM agent with multimodal capabilities in the e-commerce customer support domain. ECom-Bench features dynamic user simulation based on persona information collected from real e-commerce customer interactions and a realistic task dataset derived from authentic e-commerce dialogues. These tasks, covering a wide range of business scenarios, are designed to reflect real-world complexities, making ECom-Bench highly challenging. For instance, even advanced models like GPT-4o achieve only a 10-20% pass^3 metric in our benchmark, highlighting the substantial difficulties posed by complex e-commerce scenarios. Upon publication, the code and data will be open-sourced to facilitate further research and development in this domain.
Related papers
- MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents [21.102931466891135]
We present MindFlow, the first open-source multimodal LLM agent tailored for e-commerce.<n>It integrates memory, decision-making, and action modules, and adopts a modular "MLLM-as-Tool" strategy for effect visual-textual reasoning.
arXiv Detail & Related papers (2025-07-07T17:53:55Z) - EcomScriptBench: A Multi-task Benchmark for E-commerce Script Planning via Step-wise Intention-Driven Product Association [83.4879773429742]
This paper defines the task of E-commerce Script Planning (EcomScript) as three sequential subtasks.<n>We propose a novel framework that enables the scalable generation of product-enriched scripts by associating products with each step.<n>We construct the very first large-scale EcomScript dataset, EcomScriptBench, which includes 605,229 scripts sourced from 2.4 million products.
arXiv Detail & Related papers (2025-05-21T07:21:38Z) - ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph [31.21413440242778]
Large language models (LLMs) have demonstrated their capabilities across various NLP tasks.<n>Their potential in e-commerce is also substantial, evidenced by practical implementations such as platform search, personalized recommendations, and customer service.<n>Despite some methods proposed to evaluate LLMs' factuality, issues such as lack of reliability, high consumption, and lack of domain expertise leave a gap between effective assessment in e-commerce.<n>We propose ECKGBench, a dataset specifically designed to evaluate the capacities of LLMs in e-commerce knowledge.
arXiv Detail & Related papers (2025-03-20T09:49:15Z) - ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models [15.940958043509463]
We propose textbfChineseEcomQA, a scalable question-answering benchmark focused on fundamental e-commerce concepts.<n> Fundamental concepts are designed to be applicable across a diverse array of e-commerce tasks.<n>By carefully balancing generality and specificity, ChineseEcomQA effectively differentiates between broad e-commerce concepts.
arXiv Detail & Related papers (2025-02-27T15:36:00Z) - CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments [90.29937153770835]
We introduce CRMArena, a benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments.<n>We show that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities.<n>Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments.
arXiv Detail & Related papers (2024-11-04T17:30:51Z) - A survey on fairness of large language models in e-commerce: progress, application, and challenge [8.746342211863332]
This survey explores the fairness of large language models (LLMs) in e-commerce.
It examines their progress, applications, and the challenges they face.
The paper critically addresses the fairness challenges in e-commerce, highlighting how biases in training data and algorithms can lead to unfair outcomes.
arXiv Detail & Related papers (2024-05-15T23:25:19Z) - On the Multi-turn Instruction Following for Conversational Web Agents [83.51251174629084]
We introduce a new task of Conversational Web Navigation, which necessitates sophisticated interactions that span multiple turns with both the users and the environment.
We propose a novel framework, named self-reflective memory-augmented planning (Self-MAP), which employs memory utilization and self-reflection techniques.
arXiv Detail & Related papers (2024-02-23T02:18:12Z) - Conversational Recommender System and Large Language Model Are Made for Each Other in E-commerce Pre-sales Dialogue [80.51690477289418]
Conversational recommender systems (CRSs) learn user representation and provide accurate recommendations based on dialogue context, but rely on external knowledge.
Large language models (LLMs) generate responses that mimic pre-sales dialogues after fine-tuning, but lack domain-specific knowledge for accurate recommendations.
This paper investigates the effectiveness of combining LLM and CRS in E-commerce pre-sales dialogues.
arXiv Detail & Related papers (2023-10-23T07:00:51Z) - EcomGPT: Instruction-tuning Large Language Models with Chain-of-Task
Tasks for E-commerce [68.72104414369635]
We propose the first e-commerce instruction dataset EcomInstruct, with a total of 2.5 million instruction data.
EcomGPT outperforms ChatGPT in term of cross-dataset/task generalization on E-commerce tasks.
arXiv Detail & Related papers (2023-08-14T06:49:53Z) - LLaMA-E: Empowering E-commerce Authoring with Object-Interleaved Instruction Following [16.800545001782037]
This paper proposes LLaMA-E, the unified e-commerce authoring models that address the contextual preferences of customers, sellers, and platforms.
We design the instruction set derived from tasks of ads generation, query-enhanced product title rewriting, product classification, purchase intent speculation, and general e-commerce Q&A.
The proposed LLaMA-E models achieve state-of-the-art evaluation performance and exhibit the advantage in zero-shot practical applications.
arXiv Detail & Related papers (2023-08-09T12:26:37Z) - Automatic Controllable Product Copywriting for E-Commerce [58.97059802658354]
We deploy an E-commerce Prefix-based Controllable Copywriting Generation into the JD.com e-commerce recommendation platform.
We conduct experiments to validate the effectiveness of the proposed EPCCG.
We introduce the deployed architecture which cooperates with the EPCCG into the real-time JD.com e-commerce recommendation platform.
arXiv Detail & Related papers (2022-06-21T04:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.