Related papers: Exploring Recommender System Evaluation: A Multi-Modal User Agent Framework for A/B Testing

Exploring Recommender System Evaluation: A Multi-Modal User Agent Framework for A/B Testing

URL: http://arxiv.org/abs/2601.04554v1
Date: Thu, 08 Jan 2026 03:33:43 GMT
Title: Exploring Recommender System Evaluation: A Multi-Modal User Agent Framework for A/B Testing
Authors: Wenlin Zhang, Xiangyang Li, Qiyuan Ge, Kuicai Dong, Pengyue Jia, Xiaopeng Li, Zijian Zhang, Maolin Wang, Yichao Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao,
Abstract summary: We introduce a multi-modal user agent for A/B testing (A/B Agent)<n>Specifically, we construct a recommendation sandbox environment for A/B testing, enabling multimodal and multi-page interactions.<n>We validated the potential of the agent as an alternative to traditional A/B testing from three perspectives: model, data, and features.
Score: 54.456400601801704
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In recommender systems, online A/B testing is a crucial method for evaluating the performance of different models. However, conducting online A/B testing often presents significant challenges, including substantial economic costs, user experience degradation, and considerable time requirements. With the Large Language Models' powerful capacity, LLM-based agent shows great potential to replace traditional online A/B testing. Nonetheless, current agents fail to simulate the perception process and interaction patterns, due to the lack of real environments and visual perception capability. To address these challenges, we introduce a multi-modal user agent for A/B testing (A/B Agent). Specifically, we construct a recommendation sandbox environment for A/B testing, enabling multimodal and multi-page interactions that align with real user behavior on online platforms. The designed agent leverages multimodal information perception, fine-grained user preferences, and integrates profiles, action memory retrieval, and a fatigue system to simulate complex human decision-making. We validated the potential of the agent as an alternative to traditional A/B testing from three perspectives: model, data, and features. Furthermore, we found that the data generated by A/B Agent can effectively enhance the capabilities of recommendation models. Our code is publicly available at https://github.com/Applied-Machine-Learning-Lab/ABAgent.

Related papers

Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents [58.00130492861884]
TraitBasis is a lightweight, model-agnostic method for systematically stress testing AI agents.<n>TraitBasis learns directions in activation space corresponding to steerable user traits.<n>We observe on average a 2%-30% performance degradation on $tau$-Trait across frontier models.
arXiv Detail & Related papers (2025-10-06T05:03:57Z)
PUB: An LLM-Enhanced Personality-Driven User Behaviour Simulator for Recommender System Evaluation [9.841963696576546]
Personality-driven User Behaviour Simulator (PUB) integrates the Big Five personality traits to model personalised user behaviour.<n>PUB dynamically infers user personality from behavioural logs (e.g., ratings, reviews) and item metadata, then generates synthetic interactions that preserve statistical fidelity to real-world data.<n> Experiments on the Amazon review datasets show that logs generated by PUB closely align with real user behaviour and reveal meaningful associations between personality traits and recommendation outcomes.
arXiv Detail & Related papers (2025-06-05T01:57:36Z)
LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback [121.78866929908871]
Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data.<n>We present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback.<n>Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback.
arXiv Detail & Related papers (2025-06-02T22:36:02Z)
A Reinforcement-Learning-Enhanced LLM Framework for Automated A/B Testing in Personalized Marketing [5.250286096386298]
We present a new approach, the RL-LLM-AB test framework, for using reinforcement learning strategy optimization combined with LLM to automate and personalize A/B tests.<n>The framework is built upon the pre-trained instruction-tuned language model and generates A/B versions of candidate content variants.<n> Numerical results demonstrate the superiority of our proposed RL-LLM-ABTest over existing A/B testing methods.
arXiv Detail & Related papers (2025-05-27T03:31:07Z)
Beyond Static Testbeds: An Interaction-Centric Agent Simulation Platform for Dynamic Recommender Systems [37.15496324034216]
RecInter is a novel agent-based simulation platform for recommender systems.<n>In RecInter, simulated user actions (e.g., likes, reviews, purchases) dynamically update item attributes in real-time.<n> Merchant Agents can reply, fostering a more realistic and evolving ecosystem.
arXiv Detail & Related papers (2025-05-22T09:14:23Z)
AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents [35.8650712223701]
A/B testing remains constrained by its dependence on the large-scale and live traffic of human participants.<n>We present AgentA/B, a novel system that automatically simulate user interaction behaviors with real webpages.<n>Our findings suggest AgentA/B can emulate human-like behavior patterns.
arXiv Detail & Related papers (2025-04-13T21:10:56Z)
MageBench: Bridging Large Multimodal Models to Agents [90.59091431806793]
LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents.<n>Existing benchmarks mostly assess their reasoning abilities in language part.<n>MageBench is a reasoning capability oriented multimodal agent benchmark.
arXiv Detail & Related papers (2024-12-05T17:08:19Z)
SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse Modalities [50.6382396309597]
Unsupervised Domain Adaptation (DA) consists of adapting a model trained on a labeled source domain to perform well on an unlabeled target domain with some data distribution shift.<n>We present a complete and fair evaluation of existing shallow algorithms, including reweighting, mapping, and subspace alignment.<n>Our benchmark highlights the importance of realistic validation and provides practical guidance for real-life applications.
arXiv Detail & Related papers (2024-07-16T12:52:29Z)
On Generative Agents in Recommendation [58.42840923200071]
Agent4Rec is a user simulator in recommendation based on Large Language Models. Each agent interacts with personalized recommender models in a page-by-page manner.
arXiv Detail & Related papers (2023-10-16T06:41:16Z)
User Behavior Simulation with Large Language Model based Agents [116.74368915420065]
We propose an LLM-based agent framework and design a sandbox environment to simulate real user behaviors. Based on extensive experiments, we find that the simulated behaviors of our method are very close to the ones of real humans.
arXiv Detail & Related papers (2023-06-05T02:58:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.