Related papers: IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce

IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce

URL: http://arxiv.org/abs/2406.10173v1
Date: Fri, 14 Jun 2024 16:51:21 GMT
Title: IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce
Authors: Wenxuan Ding, Weiqi Wang, Sze Heng Douglas Kwok, Minghao Liu, Tianqing Fang, Jiaxin Bai, Junxian He, Yangqiu Song,
Abstract summary: In this paper, we present IntentionQA, a benchmark to evaluate LMs' comprehension of purchase intentions in E-commerce. IntentionQA consists of 4,360 carefully curated problems across three difficulty levels, constructed using an automated pipeline. Human evaluations demonstrate the high quality and low false-negative rate of our benchmark.
Score: 50.41970803871156
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Enhancing Language Models' (LMs) ability to understand purchase intentions in E-commerce scenarios is crucial for their effective assistance in various downstream tasks. However, previous approaches that distill intentions from LMs often fail to generate meaningful and human-centric intentions applicable in real-world E-commerce contexts. This raises concerns about the true comprehension and utilization of purchase intentions by LMs. In this paper, we present IntentionQA, a double-task multiple-choice question answering benchmark to evaluate LMs' comprehension of purchase intentions in E-commerce. Specifically, LMs are tasked to infer intentions based on purchased products and utilize them to predict additional purchases. IntentionQA consists of 4,360 carefully curated problems across three difficulty levels, constructed using an automated pipeline to ensure scalability on large E-commerce platforms. Human evaluations demonstrate the high quality and low false-negative rate of our benchmark. Extensive experiments across 19 language models show that they still struggle with certain scenarios, such as understanding products and intentions accurately, jointly reasoning with products and intentions, and more, in which they fall far behind human performances. Our code and data are publicly available at https://github.com/HKUST-KnowComp/IntentionQA.

Related papers

EcomScriptBench: A Multi-task Benchmark for E-commerce Script Planning via Step-wise Intention-Driven Product Association [83.4879773429742]
This paper defines the task of E-commerce Script Planning (EcomScript) as three sequential subtasks.<n>We propose a novel framework that enables the scalable generation of product-enriched scripts by associating products with each step.<n>We construct the very first large-scale EcomScript dataset, EcomScriptBench, which includes 605,229 scripts sourced from 2.4 million products.
arXiv Detail & Related papers (2025-05-21T07:21:38Z)
ECKGBench: Benchmarking Large Language Models in E-commerce Leveraging Knowledge Graph [31.21413440242778]
Large language models (LLMs) have demonstrated their capabilities across various NLP tasks. Their potential in e-commerce is also substantial, evidenced by practical implementations such as platform search, personalized recommendations, and customer service. Despite some methods proposed to evaluate LLMs' factuality, issues such as lack of reliability, high consumption, and lack of domain expertise leave a gap between effective assessment in e-commerce. We propose ECKGBench, a dataset specifically designed to evaluate the capacities of LLMs in e-commerce knowledge.
arXiv Detail & Related papers (2025-03-20T09:49:15Z)
ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models [15.940958043509463]
We propose textbfChineseEcomQA, a scalable question-answering benchmark focused on fundamental e-commerce concepts. Fundamental concepts are designed to be applicable across a diverse array of e-commerce tasks. By carefully balancing generality and specificity, ChineseEcomQA effectively differentiates between broad e-commerce concepts.
arXiv Detail & Related papers (2025-02-27T15:36:00Z)
Image Score: Learning and Evaluating Human Preferences for Mercari Search [2.1555050262085027]
Large Language Models (LLMs) are being actively studied and used for data labelling tasks. We propose a cost-efficient LLM-driven approach for assessing and predicting image quality in e-commerce settings. We show that our LLM-produced labels correlate with user behavior on Mercari.
arXiv Detail & Related papers (2024-08-21T05:30:06Z)
MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding [67.26334044239161]
MIND is a framework that infers purchase intentions from multimodal product metadata and prioritizes human-centric ones. Using Amazon Review data, we create a multimodal intention knowledge base, which contains 1,264,441 million intentions. Our obtained intentions significantly enhance large language models in two intention comprehension tasks.
arXiv Detail & Related papers (2024-06-15T17:56:09Z)
A survey on fairness of large language models in e-commerce: progress, application, and challenge [8.746342211863332]
This survey explores the fairness of large language models (LLMs) in e-commerce. It examines their progress, applications, and the challenges they face. The paper critically addresses the fairness challenges in e-commerce, highlighting how biases in training data and algorithms can lead to unfair outcomes.
arXiv Detail & Related papers (2024-05-15T23:25:19Z)
FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z)
Unlocking the `Why' of Buying: Introducing a New Dataset and Benchmark for Purchase Reason and Post-Purchase Experience [24.949929747493204]
We propose purchase reason prediction as a novel task for modern AI models. We first generate a dataset that consists of real-world explanations of why users make certain purchase decisions for various products. Our approach induces LLMs to explicitly distinguish between the reasons behind purchasing a product and the experience after the purchase in a user review.
arXiv Detail & Related papers (2024-02-20T23:04:06Z)
EmoBench: Evaluating the Emotional Intelligence of Large Language Models [73.60839120040887]
EmoBench is a benchmark that draws upon established psychological theories and proposes a comprehensive definition for machine Emotional Intelligence (EI) EmoBench includes a set of 400 hand-crafted questions in English and Chinese, which are meticulously designed to require thorough reasoning and understanding. Our findings reveal a considerable gap between the EI of existing Large Language Models and the average human, highlighting a promising direction for future research.
arXiv Detail & Related papers (2024-02-19T11:48:09Z)
EcomGPT: Instruction-tuning Large Language Models with Chain-of-Task Tasks for E-commerce [68.72104414369635]
We propose the first e-commerce instruction dataset EcomInstruct, with a total of 2.5 million instruction data. EcomGPT outperforms ChatGPT in term of cross-dataset/task generalization on E-commerce tasks.
arXiv Detail & Related papers (2023-08-14T06:49:53Z)
Commonsense Knowledge Salience Evaluation with a Benchmark Dataset in E-commerce [42.726755541409545]
In e-commerce, the salience of commonsense knowledge (CSK) is beneficial for widespread applications such as product search and recommendation. However, many existing CSK collections rank statements solely by confidence scores, and there is no information about which ones are salient from a human perspective. In this work, we define the task of supervised salience evaluation, where given a CSK triple, the model is required to learn whether the triple is salient or not.
arXiv Detail & Related papers (2022-05-22T15:01:23Z)
E-BERT: A Phrase and Product Knowledge Enhanced Language Model for E-commerce [63.333860695727424]
E-commerce tasks require accurate understanding of domain phrases, whereas such fine-grained phrase-level knowledge is not explicitly modeled by BERT's training objective. To tackle the problem, we propose a unified pre-training framework, namely, E-BERT. Specifically, to preserve phrase-level knowledge, we introduce Adaptive Hybrid Masking, which allows the model to adaptively switch from learning preliminary word knowledge to learning complex phrases. To utilize product-level knowledge, we introduce Neighbor Product Reconstruction, which trains E-BERT to predict a product's associated neighbors with a denoising cross attention layer
arXiv Detail & Related papers (2020-09-07T00:15:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.