Related papers: ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

URL: http://arxiv.org/abs/2602.16938v1
Date: Wed, 18 Feb 2026 23:00:21 GMT
Title: ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
Authors: Ofer Meshi, Krisztian Balog, Sally Goldman, Avi Caciularu, Guy Tennenholtz, Jihwan Jeong, Amir Globerson, Craig Boutilier,
Abstract summary: We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap.<n>Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation.<n>We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation.
Score: 48.83868690303791
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.

Related papers

Personas within Parameters: Fine-Tuning Small Language Models with Low-Rank Adapters to Mimic User Behaviors [1.8352113484137629]
A long-standing challenge in developing accurate recommendation models is simulating user behavior, mainly due to the complex nature of user interactions.<n>We propose an approach to extracting robust user representations using a frozen Large Language Models (LLMs) and simulating cost-effective, resource-efficient user agents powered by fine-tuned Small Language Models (SLMs)<n>Our experiments provide compelling empirical evidence of the efficacy of our methods, demonstrating that user agents developed using our approach have the potential to bridge the gap between offline metrics and real-world performance of recommender systems.
arXiv Detail & Related papers (2025-08-18T22:14:57Z)
PUB: An LLM-Enhanced Personality-Driven User Behaviour Simulator for Recommender System Evaluation [9.841963696576546]
Personality-driven User Behaviour Simulator (PUB) integrates the Big Five personality traits to model personalised user behaviour.<n>PUB dynamically infers user personality from behavioural logs (e.g., ratings, reviews) and item metadata, then generates synthetic interactions that preserve statistical fidelity to real-world data.<n> Experiments on the Amazon review datasets show that logs generated by PUB closely align with real user behaviour and reveal meaningful associations between personality traits and recommendation outcomes.
arXiv Detail & Related papers (2025-06-05T01:57:36Z)
Search-Based Interaction For Conversation Recommendation via Generative Reward Model Based Simulated User [117.82681846559909]
Conversational recommendation systems (CRSs) use multi-turn interaction to capture user preferences and provide personalized recommendations.<n>We propose a generative reward model based simulated user, named GRSU, for automatic interaction with CRSs.
arXiv Detail & Related papers (2025-04-29T06:37:30Z)
Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles [37.43150003866563]
We introduce the User Simulator with Implicit Profiles (USP), a framework that infers implicit user profiles from human-machine interactions to simulate personalized and realistic dialogues.<n>USP outperforms strong baselines in terms of authenticity and diversity while maintaining comparable consistency.
arXiv Detail & Related papers (2025-02-26T09:26:54Z)
Implicitly Guided Design with PropEn: Match your Data to Follow the Gradient [52.2669490431145]
PropEn is inspired by'matching', which enables implicit guidance without training a discriminator. We show that training with a matched dataset approximates the gradient of the property of interest while remaining within the data distribution.
arXiv Detail & Related papers (2024-05-28T11:30:19Z)
How Reliable is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conversational Recommendation [14.646529557978512]
We analyze the limitations of using Large Language Models in constructing user simulators for Conversational Recommender System. Data leakage, which occurs in conversational history and the user simulator's replies, results in inflated evaluation results. We propose SimpleUserSim, employing a straightforward strategy to guide the topic toward the target items.
arXiv Detail & Related papers (2024-03-25T04:21:06Z)
Reliable LLM-based User Simulator for Task-Oriented Dialogue Systems [2.788542465279969]
This paper introduces DAUS, a Domain-Aware User Simulator. We fine-tune DAUS on real examples of task-oriented dialogues. Results on two relevant benchmarks showcase significant improvements in terms of user goal fulfillment.
arXiv Detail & Related papers (2024-02-20T20:57:47Z)
HyperImpute: Generalized Iterative Imputation with Automatic Model Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models. We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z)
Metaphorical User Simulators for Evaluating Task-oriented Dialogue Systems [80.77917437785773]
Task-oriented dialogue systems ( TDSs) are assessed mainly in an offline setting or through human evaluation. We propose a metaphorical user simulator for end-to-end TDS evaluation, where we define a simulator to be metaphorical if it simulates user's analogical thinking in interactions with systems. We also propose a tester-based evaluation framework to generate variants, i.e., dialogue systems with different capabilities.
arXiv Detail & Related papers (2022-04-02T05:11:03Z)
AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective. We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.