Related papers: Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models

Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models

URL: http://arxiv.org/abs/2410.20745v2
Date: Thu, 31 Oct 2024 12:54:46 GMT
Title: Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models
Authors: Yilun Jin, Zheng Li, Chenwei Zhang, Tianyu Cao, Yifan Gao, Pratik Jayarao, Mao Li, Xin Liu, Ritesh Sarkhel, Xianfeng Tang, Haodong Wang, Zhengyang Wang, Wenju Xu, Jingfeng Yang, Qingyu Yin, Xian Li, Priyanka Nigam, Yi Xu, Kai Chen, Qiang Yang, Meng Jiang, Bing Yin,
Abstract summary: Large Language Models (LLMs) have the potential to transform online shopping by alleviating task-specific engineering efforts. We propose Shopping MMLU, a diverse multi-task online shopping benchmark derived from real-world Amazon data. Shopping MMLU consists of 57 tasks covering 4 major shopping skills: concept understanding, knowledge reasoning, user behavior alignment, and multi-linguality.
Score: 95.34001906930152
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Online shopping is a complex multi-task, few-shot learning problem with a wide and evolving range of entities, relations, and tasks. However, existing models and benchmarks are commonly tailored to specific tasks, falling short of capturing the full complexity of online shopping. Large Language Models (LLMs), with their multi-task and few-shot learning abilities, have the potential to profoundly transform online shopping by alleviating task-specific engineering efforts and by providing users with interactive conversations. Despite the potential, LLMs face unique challenges in online shopping, such as domain-specific concepts, implicit knowledge, and heterogeneous user behaviors. Motivated by the potential and challenges, we propose Shopping MMLU, a diverse multi-task online shopping benchmark derived from real-world Amazon data. Shopping MMLU consists of 57 tasks covering 4 major shopping skills: concept understanding, knowledge reasoning, user behavior alignment, and multi-linguality, and can thus comprehensively evaluate the abilities of LLMs as general shop assistants. With Shopping MMLU, we benchmark over 20 existing LLMs and uncover valuable insights about practices and prospects of building versatile LLM-based shop assistants. Shopping MMLU can be publicly accessed at https://github.com/KL4805/ShoppingMMLU. In addition, with Shopping MMLU, we host a competition in KDD Cup 2024 with over 500 participating teams. The winning solutions and the associated workshop can be accessed at our website https://amazon-kddcup24.github.io/.

Related papers

MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation [52.35744453954844]
This paper introduces MMRC, a benchmark for evaluating six core open-ended abilities of MLLMs. Evaluations on 20 MLLMs in MMRC indicate an accuracy drop during open-ended interactions. We propose a simple yet effective NOTE-TAKING strategy, which can record key information from the conversation and remind the model during its responses.
arXiv Detail & Related papers (2025-02-17T15:24:49Z)
Humanity's Last Exam [434.8511341499966]
Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge. It consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval.
arXiv Detail & Related papers (2025-01-24T05:27:46Z)
MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge [24.66666826440994]
MINTQA is a benchmark to evaluate large language models' capabilities in multi-hop reasoning. MINTQA comprises 10,479 question-answer pairs for evaluating new knowledge and 17,887 pairs for assessing long-tail knowledge. Our systematic evaluation of 22 state-of-the-art LLMs on MINTQA reveals significant limitations in their ability to handle complex knowledge base queries.
arXiv Detail & Related papers (2024-12-22T14:17:12Z)
MARCO: Multi-Agent Real-time Chat Orchestration [6.7741570640544415]
We present MARCO, a Multi-Agent Real-time Chat Orchestration framework for automating tasks using LLMs. MARCO addresses key challenges in utilizing LLMs for complex, multi-step task execution. We show MARCO's superior performance with 94.48% and 92.74% accuracy on task execution for Digital Restaurant Service Platform conversations and Retail conversations datasets respectively.
arXiv Detail & Related papers (2024-10-29T06:42:27Z)
Probing the Robustness of Theory of Mind in Large Language Models [6.7932860553262415]
We introduce a novel dataset of 68 tasks for probing ToM in LLMs. We evaluate the ToM performance of four SotA open source LLMs on our dataset and the dataset introduced by (Kosinski, 2023) We find a consistent tendency in all tested LLMs to perform poorly on tasks that require the realization that an agent has knowledge of automatic state changes in its environment.
arXiv Detail & Related papers (2024-10-08T18:13:27Z)
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines [91.08394877954322]
Large Multimodal Models (LMMs) have made impressive strides in AI search engines. But, whether they can function as AI search engines remains under-explored. We first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities.
arXiv Detail & Related papers (2024-09-19T17:59:45Z)
SEQ+MD: Learning Multi-Task as a SEQuence with Multi-Distribution Data [5.069855142454979]
We propose the SEQ+MD framework, which integrates sequential learning for multi-task learning (MTL) and feature-generated region-mask for multi-distribution input. We show a strong increase on the high-value engagement including add-to-cart and purchase while keeping click performance neutral. Our multi-regional learning module is "plug-and-play" and can be easily adapted to enhance other MTL applications.
arXiv Detail & Related papers (2024-08-23T20:14:27Z)
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models [76.1999277491816]
Multimodal Multi-image Understanding (MMIU) is a comprehensive evaluation suite designed to assess Large Vision-Language Models (LVLMs) MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions. Our evaluation of 24 popular LVLMs, including both open-source and proprietary models, reveals significant challenges in multi-image comprehension.
arXiv Detail & Related papers (2024-08-05T17:56:41Z)
LLaSA: Large Language and E-Commerce Shopping Assistant [17.53318263751155]
We create an instruction dataset comprising 65,000 samples and diverse tasks, termed as EshopInstruct. Through instruction tuning on our dataset, the assistant, named LLaSA, demonstrates the potential to function as an omnipotent assistant. In the Amazon KDD Cup 2024 Challenge, our proposed method, LLaSA, achieved an overall ranking of 3rd place on ShopBench.
arXiv Detail & Related papers (2024-08-04T12:10:51Z)
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark [44.840266648465054]
This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.
arXiv Detail & Related papers (2024-06-03T17:53:00Z)
MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions [58.57255822646756]
This paper introduces MathChat, a benchmark designed to evaluate large language models (LLMs) across a broader spectrum of mathematical tasks. We evaluate the performance of various SOTA LLMs on the MathChat benchmark, and we observe that while these models excel in single turn question answering, they significantly underperform in more complex scenarios. We develop MathChat sync, a synthetic dialogue based math dataset for LLM finetuning, focusing on improving models' interaction and instruction following capabilities in conversations.
arXiv Detail & Related papers (2024-05-29T18:45:55Z)
Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation [52.930183136111864]
We propose using scorable negotiation to evaluate Large Language Models (LLMs) To reach an agreement, agents must have strong arithmetic, inference, exploration, and planning capabilities. We provide procedures to create new games and increase games' difficulty to have an evolving benchmark.
arXiv Detail & Related papers (2023-09-29T13:33:06Z)
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.