Related papers: Validating Search Query Simulations: A Taxonomy of Measures

Validating Search Query Simulations: A Taxonomy of Measures

URL: http://arxiv.org/abs/2601.11412v1
Date: Fri, 16 Jan 2026 16:33:25 GMT
Title: Validating Search Query Simulations: A Taxonomy of Measures
Authors: Andreas Konstantin Kruff, Nolwenn Bernard, Philipp Schaer,
Abstract summary: We conduct a literature review on methods for the validation of simulated user queries with regard to real queries.<n>Based on the review, we develop a taxonomy that structures the current landscape of available measures.<n>We empirically corroborate the taxonomy by analyzing the relationships between the different measures applied to four different datasets.
Score: 8.19836974395553
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Assessing the validity of user simulators when used for the evaluation of information retrieval systems remains an open question, constraining their effective use and the reliability of simulation-based results. To address this issue, we conduct a comprehensive literature review with a particular focus on methods for the validation of simulated user queries with regard to real queries. Based on the review, we develop a taxonomy that structures the current landscape of available measures. We empirically corroborate the taxonomy by analyzing the relationships between the different measures applied to four different datasets representing diverse search scenarios. Finally, we provide concrete recommendations on which measures or combinations of measures should be considered when validating user simulation in different contexts. Furthermore, we release a dedicated library with the most commonly used measures to facilitate future research.

Related papers

Towards Context-aware Reasoning-enhanced Generative Searching in E-commerce [61.03081096959132]
We propose a context-aware reasoning-enhanced generative search framework for better textbfunderstanding the complicated context.<n>Our approach achieves superior performance compared with strong baselines, validating its effectiveness for search-based recommendation.
arXiv Detail & Related papers (2025-10-19T16:46:11Z)
Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior [58.58249548116766]
We present an experimental recipe for studying the relationship between training data and language model (LM) behavior.<n>We outline steps for intervening on data batches and then retraining model checkpoints over that data to test hypotheses relating data to behavior.
arXiv Detail & Related papers (2025-10-16T03:22:48Z)
Evaluating Contrastive Feedback for Effective User Simulations [2.8089969618577997]
This study explores whether the underlying principles of contrastive training techniques can be applied beneficially in the area of prompt engineering for user simulations.<n>The primary objective of this study is to analyze how different modalities of contextual information influence the effectiveness of user simulations.
arXiv Detail & Related papers (2025-05-05T11:02:31Z)
Do Retrieval-Augmented Language Models Adapt to Varying User Needs? [28.729041459278587]
This paper introduces a novel evaluation framework that systematically assesses RALMs under three user need cases.<n>By varying both user instructions and the nature of retrieved information, our approach captures the complexities of real-world applications.<n>Our findings highlight the necessity of user-centric evaluations in the development of retrieval-augmented systems.
arXiv Detail & Related papers (2025-02-27T05:39:38Z)
Scenario-Wise Rec: A Multi-Scenario Recommendation Benchmark [65.13288661320364]
We introduce our benchmark, textbfScenario-Wise Rec, which comprises 6 public datasets and 12 benchmark models, along with a training and evaluation pipeline.<n>We aim for this benchmark to offer researchers valuable insights from prior work, enabling the development of novel models.
arXiv Detail & Related papers (2024-12-23T08:15:34Z)
Quantifying User Coherence: A Unified Framework for Cross-Domain Recommendation Analysis [69.37718774071793]
This paper introduces novel information-theoretic measures for understanding recommender systems. We evaluate 7 recommendation algorithms across 9 datasets, revealing the relationships between our measures and standard performance metrics.
arXiv Detail & Related papers (2024-10-03T13:02:07Z)
PredBench: Benchmarking Spatio-Temporal Prediction across Diverse Disciplines [86.36060279469304]
We introduce PredBench, a benchmark tailored for the holistic evaluation of prediction-temporal networks. This benchmark integrates 12 widely adopted methods with diverse datasets across multiple application domains. Its multi-dimensional evaluation framework broadens the analysis with a comprehensive set of metrics.
arXiv Detail & Related papers (2024-07-11T11:51:36Z)
A Case Study on Designing Evaluations of ML Explanations with Simulated User Studies [6.2511886555343805]
We conduct the first SimEvals on a real-world use case to evaluate whether explanations can better support ML-assisted decision-making in e-commerce fraud detection. We find that SimEvals suggest that all considered explainers are equally performant, and none beat a baseline without explanations.
arXiv Detail & Related papers (2023-02-15T03:27:55Z)
Synthetic Data-Based Simulators for Recommender Systems: A Survey [55.60116686945561]
This survey aims at providing a comprehensive overview of the recent trends in the field of modeling and simulation. We start with the motivation behind the development of frameworks implementing the simulations -- simulators. We provide a new consistent classification of existing simulators based on their functionality, approbation, and industrial effectiveness.
arXiv Detail & Related papers (2022-06-22T19:33:21Z)
Characterizing and comparing external measures for the assessment of cluster analysis and community detection [1.5543116359698947]
Many external evaluation measures have been proposed in the literature to compare two partitions of the same set. This makes the task of selecting the most appropriate measure for a given situation a challenge for the end user. We propose a new empirical evaluation framework to solve this issue, and help the end user selecting an appropriate measure for their application.
arXiv Detail & Related papers (2021-02-01T09:10:25Z)
Metric Learning for Session-based Recommendations [3.706222947143855]
We discuss and compare metric learning approaches to commonly used learning-to-rank methods. We propose a simple architecture for problem analysis and demonstrate that neither extensively big nor deep architectures are necessary.
arXiv Detail & Related papers (2021-01-07T17:51:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.