SimAB: Simulating A/B Tests with Persona-Conditioned AI Agents for Rapid Design Evaluation
- URL: http://arxiv.org/abs/2603.01024v1
- Date: Sun, 01 Mar 2026 10:08:27 GMT
- Title: SimAB: Simulating A/B Tests with Persona-Conditioned AI Agents for Rapid Design Evaluation
- Authors: Tim Rieder, Marian Schneider, Mario Truss, Vitaly Tsaplin, Alina Rublea, Sinem Dere, Francisco Chicharro Sanz, Tobias Reiss, Mustafa Doga Dogan,
- Abstract summary: We present SimAB, a system that reframes A/B testing as a fast, privacy-preserving simulation using persona-conditioned AI agents.<n>Given design screenshots and a conversion goal, SimAB generates user personas, deploys them as agents that state their preference, aggregates results, and synthesizes rationales.
- Score: 3.609531017498719
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: A/B testing is a standard method for validating design decisions, yet its reliance on real user traffic limits iteration speed and makes certain experiments impractical. We present SimAB, a system that reframes A/B testing as a fast, privacy-preserving simulation using persona-conditioned AI agents. Given design screenshots and a conversion goal, SimAB generates user personas, deploys them as agents that state their preference, aggregates results, and synthesizes rationales. Through a formative study with experimentation practitioners, we identified scenarios where traffic constraints hinder testing, including low-traffic pages, multi-variant comparisons, micro-optimizations, and privacy-sensitive contexts. Our design emphasizes speed, early feedback, actionable rationales, and audience specification. We evaluate SimAB against 47 historical A/B tests with known outcomes, achieving 67% overall accuracy, increasing to 83% for high-confidence cases. Additional experiments show robustness to naming and positional bias and demonstrate accuracy gains from personas. Practitioner feedback suggests that SimAB supports faster evaluation cycles and rapid screening of designs difficult to assess with traditional A/B tests.
Related papers
- ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders [48.83868690303791]
We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap.<n>Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation.<n>We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation.
arXiv Detail & Related papers (2026-02-18T23:00:21Z) - SimGym: Traffic-Grounded Browser Agents for Offline A/B Testing in E-Commerce [8.496158383334]
SimGym is a scalable system for rapid offline A/B testing using traffic-grounded synthetic buyers powered by Large Language Model agents operating in a live browser.<n>SimGym extracts per-shop buyer profiles and intents from production interaction data.<n>We validate SimGym against real human outcomes from real UI changes on a major e-commerce platform under confounder control.
arXiv Detail & Related papers (2026-02-01T21:23:04Z) - Exploring Recommender System Evaluation: A Multi-Modal User Agent Framework for A/B Testing [54.456400601801704]
We introduce a multi-modal user agent for A/B testing (A/B Agent)<n>Specifically, we construct a recommendation sandbox environment for A/B testing, enabling multimodal and multi-page interactions.<n>We validated the potential of the agent as an alternative to traditional A/B testing from three perspectives: model, data, and features.
arXiv Detail & Related papers (2026-01-08T03:33:43Z) - Sim4IA-Bench: A User Simulation Benchmark Suite for Next Query and Utterance Prediction [18.30483927706278]
We present Sim4IA-Bench, a simulation benchmark suit for the prediction of the next queries and utterances.<n>Our dataset comprises 160 real-world search sessions from the CORE search engine.<n>Sim4IA-Bench provides a basis for evaluating and comparing user simu- lation approaches.
arXiv Detail & Related papers (2025-11-12T13:44:12Z) - Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents [58.00130492861884]
TraitBasis is a lightweight, model-agnostic method for systematically stress testing AI agents.<n>TraitBasis learns directions in activation space corresponding to steerable user traits.<n>We observe on average a 2%-30% performance degradation on $tau$-Trait across frontier models.
arXiv Detail & Related papers (2025-10-06T05:03:57Z) - Harnessing the Power of Interleaving and Counterfactual Evaluation for Airbnb Search Ranking [14.97060265751423]
Evaluation plays a crucial role in the development of ranking algorithms on search and recommender systems.<n>Online environment is conducive to applying causal inference techniques.<n>Business face unique challenges when it comes to effective A/B test.
arXiv Detail & Related papers (2025-08-01T16:28:18Z) - TestAgent: An Adaptive and Intelligent Expert for Human Assessment [62.060118490577366]
We propose TestAgent, a large language model (LLM)-powered agent designed to enhance adaptive testing through interactive engagement.<n>TestAgent supports personalized question selection, captures test-takers' responses and anomalies, and provides precise outcomes through dynamic, conversational interactions.
arXiv Detail & Related papers (2025-06-03T16:07:54Z) - Metaphorical User Simulators for Evaluating Task-oriented Dialogue
Systems [80.77917437785773]
Task-oriented dialogue systems ( TDSs) are assessed mainly in an offline setting or through human evaluation.
We propose a metaphorical user simulator for end-to-end TDS evaluation, where we define a simulator to be metaphorical if it simulates user's analogical thinking in interactions with systems.
We also propose a tester-based evaluation framework to generate variants, i.e., dialogue systems with different capabilities.
arXiv Detail & Related papers (2022-04-02T05:11:03Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.