Related papers: SimGym: Traffic-Grounded Browser Agents for Offline A/B Testing in E-Commerce

SimGym: Traffic-Grounded Browser Agents for Offline A/B Testing in E-Commerce

URL: http://arxiv.org/abs/2602.01443v1
Date: Sun, 01 Feb 2026 21:23:04 GMT
Title: SimGym: Traffic-Grounded Browser Agents for Offline A/B Testing in E-Commerce
Authors: Alberto Castelo, Zahra Zanjani Foumani, Ailin Fan, Keat Yang Koay, Vibhor Malik, Yuanzheng Zhu, Han Li, Meysam Feghhi, Ronie Uliana, Shuang Xie, Zhaoyu Zhang, Angelo Ocana Martins, Mingyu Zhao, Francis Pelland, Jonathan Faerman, Nikolas LeBlanc, Aaron Glazer, Andrew McNamara, Lingyun Wang, Zhong Wu,
Abstract summary: SimGym is a scalable system for rapid offline A/B testing using traffic-grounded synthetic buyers powered by Large Language Model agents operating in a live browser.<n>SimGym extracts per-shop buyer profiles and intents from production interaction data.<n>We validate SimGym against real human outcomes from real UI changes on a major e-commerce platform under confounder control.
Score: 8.496158383334
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A/B testing remains the gold standard for evaluating e-commerce UI changes, yet it diverts traffic, takes weeks to reach significance, and risks harming user experience. We introduce SimGym, a scalable system for rapid offline A/B testing using traffic-grounded synthetic buyers powered by Large Language Model agents operating in a live browser. SimGym extracts per-shop buyer profiles and intents from production interaction data, identifies distinct behavioral archetypes, and simulates cohort-weighted sessions across control and treatment storefronts. We validate SimGym against real human outcomes from real UI changes on a major e-commerce platform under confounder control. Even without alignment post training, SimGym agents achieve state of the art alignment with observed outcome shifts and reduces experiment cycles from weeks to under an hour , enabling rapid experimentation without exposure to real buyers.

Related papers

SimAB: Simulating A/B Tests with Persona-Conditioned AI Agents for Rapid Design Evaluation [3.609531017498719]
We present SimAB, a system that reframes A/B testing as a fast, privacy-preserving simulation using persona-conditioned AI agents.<n>Given design screenshots and a conversion goal, SimAB generates user personas, deploys them as agents that state their preference, aggregates results, and synthesizes rationales.
arXiv Detail & Related papers (2026-03-01T10:08:27Z)
Exploring Recommender System Evaluation: A Multi-Modal User Agent Framework for A/B Testing [54.456400601801704]
We introduce a multi-modal user agent for A/B testing (A/B Agent)<n>Specifically, we construct a recommendation sandbox environment for A/B testing, enabling multimodal and multi-page interactions.<n>We validated the potential of the agent as an alternative to traditional A/B testing from three perspectives: model, data, and features.
arXiv Detail & Related papers (2026-01-08T03:33:43Z)
See, Think, Act: Online Shopper Behavior Simulation with VLM Agents [58.92444959954643]
This paper investigates the integration of visual information, specifically webpage screenshots, into behavior simulation via VLMs.<n>We employ SFT for joint action prediction and rationale generation, conditioning on the full interaction context.<n>To further enhance reasoning capabilities, we integrate RL with a hierarchical reward structure, scaled by a difficulty-aware factor.
arXiv Detail & Related papers (2025-10-22T05:07:14Z)
Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents [58.00130492861884]
TraitBasis is a lightweight, model-agnostic method for systematically stress testing AI agents.<n>TraitBasis learns directions in activation space corresponding to steerable user traits.<n>We observe on average a 2%-30% performance degradation on $tau$-Trait across frontier models.
arXiv Detail & Related papers (2025-10-06T05:03:57Z)
AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents [35.8650712223701]
A/B testing remains constrained by its dependence on the large-scale and live traffic of human participants.<n>We present AgentA/B, a novel system that automatically simulate user interaction behaviors with real webpages.<n>Our findings suggest AgentA/B can emulate human-like behavior patterns.
arXiv Detail & Related papers (2025-04-13T21:10:56Z)
PAARS: Persona Aligned Agentic Retail Shoppers [2.8737584376365355]
In e-commerce, behavioral data is collected for decision making which can be costly and slow.<n>We propose a framework that creates synthetic shopping agents by automatically mining anonymised historical shopping data.<n>We showcase an initial application of our framework for automated agentic A/B testing and compare the findings to human results.
arXiv Detail & Related papers (2025-03-31T15:41:51Z)
CreAgent: Towards Long-Term Evaluation of Recommender System under Platform-Creator Information Asymmetry [55.64992650205645]
We propose CreAgent, a large language model-empowered creator simulation agent.<n>By incorporating game theory's belief mechanism and the fast-and-slow thinking framework, CreAgent effectively simulates creator behavior.<n>Our credibility validation experiments show that CreAgent aligns well with the behaviors between real-world platform and creator.
arXiv Detail & Related papers (2025-02-11T07:09:49Z)
Promptable Closed-loop Traffic Simulation [57.36568236100507]
ProSim is a multimodal promptable closed-loop traffic simulation framework. ProSim rolls out a traffic scenario in a closed-loop manner, modeling each agent's interaction with other traffic participants. To support research on promptable traffic simulation, we create ProSim-Instruct-520k, a multimodal prompt-scenario paired driving dataset.
arXiv Detail & Related papers (2024-09-09T17:59:15Z)
User Behavior Simulation with Large Language Model based Agents [116.74368915420065]
We propose an LLM-based agent framework and design a sandbox environment to simulate real user behaviors. Based on extensive experiments, we find that the simulated behaviors of our method are very close to the ones of real humans.
arXiv Detail & Related papers (2023-06-05T02:58:35Z)
Metaphorical User Simulators for Evaluating Task-oriented Dialogue Systems [80.77917437785773]
Task-oriented dialogue systems ( TDSs) are assessed mainly in an offline setting or through human evaluation. We propose a metaphorical user simulator for end-to-end TDS evaluation, where we define a simulator to be metaphorical if it simulates user's analogical thinking in interactions with systems. We also propose a tester-based evaluation framework to generate variants, i.e., dialogue systems with different capabilities.
arXiv Detail & Related papers (2022-04-02T05:11:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.