Related papers: ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?

ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?

URL: http://arxiv.org/abs/2507.05639v1
Date: Tue, 08 Jul 2025 03:35:48 GMT
Title: ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?
Authors: Haoxin Wang, Xianhan Peng, Xucheng Huang, Yizhe Huang, Ming Gong, Chenghan Yang, Yang Liu, Ling Jiang,
Abstract summary: We introduce ECom-Bench, the first benchmark framework for evaluating LLM agents with multimodal capabilities in the e-commerce customer support domain.<n>ECom-Bench features dynamic user simulation based on persona information collected from real e-commerce customer interactions and a realistic task dataset derived from authentic e-commerce dialogues.
Score: 20.83383124467603
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we introduce ECom-Bench, the first benchmark framework for evaluating LLM agent with multimodal capabilities in the e-commerce customer support domain. ECom-Bench features dynamic user simulation based on persona information collected from real e-commerce customer interactions and a realistic task dataset derived from authentic e-commerce dialogues. These tasks, covering a wide range of business scenarios, are designed to reflect real-world complexities, making ECom-Bench highly challenging. For instance, even advanced models like GPT-4o achieve only a 10-20% pass^3 metric in our benchmark, highlighting the substantial difficulties posed by complex e-commerce scenarios. Upon publication, the code and data will be open-sourced to facilitate further research and development in this domain.

Related papers

MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents [21.102931466891135]
We present MindFlow, the first open-source multimodal LLM agent tailored for e-commerce.<n>It integrates memory, decision-making, and action modules, and adopts a modular "MLLM-as-Tool" strategy for effect visual-textual reasoning.
arXiv Detail & Related papers (2025-07-07T17:53:55Z)
EcomScriptBench: A Multi-task Benchmark for E-commerce Script Planning via Step-wise Intention-Driven Product Association [83.4879773429742]
This paper defines the task of E-commerce Script Planning (EcomScript) as three sequential subtasks.<n>We propose a novel framework that enables the scalable generation of product-enriched scripts by associating products with each step.<n>We construct the very first large-scale EcomScript dataset, EcomScriptBench, which includes 605,229 scripts sourced from 2.4 million products.
arXiv Detail & Related papers (2025-05-21T07:21:38Z)
ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph [31.21413440242778]
Large language models (LLMs) have demonstrated their capabilities across various NLP tasks.<n>Their potential in e-commerce is also substantial, evidenced by practical implementations such as platform search, personalized recommendations, and customer service.<n>Despite some methods proposed to evaluate LLMs' factuality, issues such as lack of reliability, high consumption, and lack of domain expertise leave a gap between effective assessment in e-commerce.<n>We propose ECKGBench, a dataset specifically designed to evaluate the capacities of LLMs in e-commerce knowledge.
arXiv Detail & Related papers (2025-03-20T09:49:15Z)
ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models [15.940958043509463]
We propose textbfChineseEcomQA, a scalable question-answering benchmark focused on fundamental e-commerce concepts.<n> Fundamental concepts are designed to be applicable across a diverse array of e-commerce tasks.<n>By carefully balancing generality and specificity, ChineseEcomQA effectively differentiates between broad e-commerce concepts.
arXiv Detail & Related papers (2025-02-27T15:36:00Z)
CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments [90.29937153770835]
We introduce CRMArena, a benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments.<n>We show that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities.<n>Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments.
arXiv Detail & Related papers (2024-11-04T17:30:51Z)
A survey on fairness of large language models in e-commerce: progress, application, and challenge [8.746342211863332]
This survey explores the fairness of large language models (LLMs) in e-commerce. It examines their progress, applications, and the challenges they face. The paper critically addresses the fairness challenges in e-commerce, highlighting how biases in training data and algorithms can lead to unfair outcomes.
arXiv Detail & Related papers (2024-05-15T23:25:19Z)
On the Multi-turn Instruction Following for Conversational Web Agents [83.51251174629084]
We introduce a new task of Conversational Web Navigation, which necessitates sophisticated interactions that span multiple turns with both the users and the environment. We propose a novel framework, named self-reflective memory-augmented planning (Self-MAP), which employs memory utilization and self-reflection techniques.
arXiv Detail & Related papers (2024-02-23T02:18:12Z)
Conversational Recommender System and Large Language Model Are Made for Each Other in E-commerce Pre-sales Dialogue [80.51690477289418]
Conversational recommender systems (CRSs) learn user representation and provide accurate recommendations based on dialogue context, but rely on external knowledge. Large language models (LLMs) generate responses that mimic pre-sales dialogues after fine-tuning, but lack domain-specific knowledge for accurate recommendations. This paper investigates the effectiveness of combining LLM and CRS in E-commerce pre-sales dialogues.
arXiv Detail & Related papers (2023-10-23T07:00:51Z)
EcomGPT: Instruction-tuning Large Language Models with Chain-of-Task Tasks for E-commerce [68.72104414369635]
We propose the first e-commerce instruction dataset EcomInstruct, with a total of 2.5 million instruction data. EcomGPT outperforms ChatGPT in term of cross-dataset/task generalization on E-commerce tasks.
arXiv Detail & Related papers (2023-08-14T06:49:53Z)
LLaMA-E: Empowering E-commerce Authoring with Object-Interleaved Instruction Following [16.800545001782037]
This paper proposes LLaMA-E, the unified e-commerce authoring models that address the contextual preferences of customers, sellers, and platforms. We design the instruction set derived from tasks of ads generation, query-enhanced product title rewriting, product classification, purchase intent speculation, and general e-commerce Q&A. The proposed LLaMA-E models achieve state-of-the-art evaluation performance and exhibit the advantage in zero-shot practical applications.
arXiv Detail & Related papers (2023-08-09T12:26:37Z)
Automatic Controllable Product Copywriting for E-Commerce [58.97059802658354]
We deploy an E-commerce Prefix-based Controllable Copywriting Generation into the JD.com e-commerce recommendation platform. We conduct experiments to validate the effectiveness of the proposed EPCCG. We introduce the deployed architecture which cooperates with the EPCCG into the real-time JD.com e-commerce recommendation platform.
arXiv Detail & Related papers (2022-06-21T04:18:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.