Related papers: AgenticShop: Benchmarking Agentic Product Curation for Personalized Web Shopping

AgenticShop: Benchmarking Agentic Product Curation for Personalized Web Shopping

URL: http://arxiv.org/abs/2602.12315v1
Date: Thu, 12 Feb 2026 17:25:45 GMT
Title: AgenticShop: Benchmarking Agentic Product Curation for Personalized Web Shopping
Authors: Sunghwan Kim, Ryang Heo, Yongsik Seo, Jinyoung Yeo, Dongha Lee,
Abstract summary: We present AgenticShop, the first benchmark for evaluating agentic systems on personalized product curation in open-web environment.<n>Our approach features realistic shopping scenarios, diverse user profiles, and a verifiable, checklist-driven personalization evaluation framework.
Score: 20.52047960513448
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The proliferation of e-commerce has made web shopping platforms key gateways for customers navigating the vast digital marketplace. Yet this rapid expansion has led to a noisy and fragmented information environment, increasing cognitive burden as shoppers explore and purchase products online. With promising potential to alleviate this challenge, agentic systems have garnered growing attention for automating user-side tasks in web shopping. Despite significant advancements, existing benchmarks fail to comprehensively evaluate how well agentic systems can curate products in open-web settings. Specifically, they have limited coverage of shopping scenarios, focusing only on simplified single-platform lookups rather than exploratory search. Moreover, they overlook personalization in evaluation, leaving unclear whether agents can adapt to diverse user preferences in realistic shopping contexts. To address this gap, we present AgenticShop, the first benchmark for evaluating agentic systems on personalized product curation in open-web environment. Crucially, our approach features realistic shopping scenarios, diverse user profiles, and a verifiable, checklist-driven personalization evaluation framework. Through extensive experiments, we demonstrate that current agentic systems remain largely insufficient, emphasizing the need for user-side systems that effectively curate tailored products across the modern web.

Related papers

Magentic Marketplace: An Open-Source Environment for Studying Agentic Markets [74.91125572848439]
We study two-sided agentic marketplaces where Assistant agents represent consumers and Service agents represent competing businesses.<n>This environment enables us to study key market dynamics: the utility agents achieve, behavioral biases, vulnerability to manipulation, and how search mechanisms shape market outcomes.<n>Our experiments show that frontier models can approach optimal welfare-- but only under ideal search conditions. Performance degrades sharply with scale, and all models exhibit severe first-proposal bias, creating 10-30x advantages for response speed over quality.
arXiv Detail & Related papers (2025-10-27T18:35:59Z)
A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains [23.412858949638263]
Current benchmarks in the e-commerce domain face two major problems.<n>They primarily focus on product search tasks, failing to capture the broader range of functionalities offered by real-world e-commerce platforms.<n>We propose a new benchmark called Amazon-Bench to generate user queries that cover a broad range of tasks.
arXiv Detail & Related papers (2025-08-18T21:58:43Z)
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems [53.37728204835912]
Most existing AI systems rely on manually crafted configurations that remain static after deployment.<n>Recent research has explored agent evolution techniques that aim to automatically enhance agent systems based on interaction data and environmental feedback.<n>This survey aims to provide researchers and practitioners with a systematic understanding of self-evolving AI agents.
arXiv Detail & Related papers (2025-08-10T16:07:32Z)
KiseKloset: Comprehensive System For Outfit Retrieval, Recommendation, And Try-On [15.775881888811018]
We propose a novel comprehensive KiseKloset system for outfit retrieval, recommendation, and try-on.<n>We introduce a novel transformer architecture designed to recommend complementary items from diverse categories.<n>We employ a lightweight yet efficient virtual try-on framework capable of real-time operation, memory efficiency, and maintaining realistic outputs.
arXiv Detail & Related papers (2025-06-30T02:25:39Z)
DeepShop: A Benchmark for Deep Research Shopping Agents [70.03744154560717]
DeepShop is a benchmark designed to evaluate web agents in complex and realistic online shopping environments.<n>We generate diverse queries across five popular online shopping domains.<n>We propose an automated evaluation framework that assesses agent performance in terms of fine-grained aspects.
arXiv Detail & Related papers (2025-06-03T13:08:17Z)
WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback [78.55946306325914]
We identify key reasoning skills essential for effective web agents.<n>We reconstruct the agent's reasoning algorithms into chain-of-thought rationales.<n>Our approach yields significant improvements across multiple benchmarks.
arXiv Detail & Related papers (2025-05-26T14:03:37Z)
An Illusion of Progress? Assessing the Current State of Web Agents [61.742657650092845]
We conduct a comprehensive and rigorous assessment of the current state of web agents.<n>Results depict a very different picture of the competency of current agents, suggesting over-optimism in previously reported results.<n>We introduce Online-Mind2Web, an online evaluation benchmark consisting of 300 diverse and realistic tasks spanning 136 websites.
arXiv Detail & Related papers (2025-04-02T05:51:29Z)
Building a Scalable, Effective, and Steerable Search and Ranking Platform [0.13107669223114085]
Modern e-commerce platforms offer vast product selections, making it difficult for customers to find items that they like. It is key for e-commerce platforms to have near real-time scalable and adaptable personalized ranking and search systems. We present a personalized, near real-time ranking platform that is reusable across various use cases.
arXiv Detail & Related papers (2024-09-04T16:29:25Z)
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? [83.19032025950986]
We study the use of large language model-based agents for interacting with software via web browsers. WorkArena is a benchmark of 33 tasks based on the widely-used ServiceNow platform. BrowserGym is an environment for the design and evaluation of such agents.
arXiv Detail & Related papers (2024-03-12T14:58:45Z)
OPAM: Online Purchasing-behavior Analysis using Machine learning [0.8121462458089141]
We present a customer purchasing behavior analysis system using supervised, unsupervised and semi-supervised learning methods. The proposed system analyzes session and user-journey level purchasing behaviors to identify customer categories/clusters.
arXiv Detail & Related papers (2021-02-02T17:29:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.