Related papers: Sim4IA-Bench: A User Simulation Benchmark Suite for Next Query and Utterance Prediction

Sim4IA-Bench: A User Simulation Benchmark Suite for Next Query and Utterance Prediction

URL: http://arxiv.org/abs/2511.09329v1
Date: Thu, 13 Nov 2025 01:46:50 GMT
Title: Sim4IA-Bench: A User Simulation Benchmark Suite for Next Query and Utterance Prediction
Authors: Andreas Konstantin Kruff, Christin Katharina Kreutz, Timo Breuer, Philipp Schaer, Krisztian Balog,
Abstract summary: We present Sim4IA-Bench, a simulation benchmark suit for the prediction of the next queries and utterances.<n>Our dataset comprises 160 real-world search sessions from the CORE search engine.<n>Sim4IA-Bench provides a basis for evaluating and comparing user simu- lation approaches.
Score: 18.30483927706278
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Validating user simulation is a difficult task due to the lack of established measures and benchmarks, which makes it challenging to assess whether a simulator accurately reflects real user behavior. As part of the Sim4IA Micro-Shared Task at the Sim4IA Workshop, SIGIR 2025, we present Sim4IA-Bench, a simulation benchmark suit for the prediction of the next queries and utterances, the first of its kind in the IR com- munity. Our dataset as part of the suite comprises 160 real-world search sessions from the CORE search engine. For 70 of these sessions, up to 62 simulator runs are available, divided into Task A and Task B, in which different approaches predicted users next search queries or utterances. Sim4IA-Bench provides a basis for evaluating and comparing user simu- lation approaches and for developing new measures of simulator validity. Although modest in size, the suite represents the first publicly available benchmark that links real search sessions with simulated next-query pre- dictions. In addition to serving as a testbed for next query prediction, it also enables exploratory studies on query reformulation behavior, intent drift, and interaction-aware retrieval evaluation. We also introduce a new measure for evaluating next-query predictions in this task. By making the suite publicly available, we aim to promote reproducible research and stimulate further work on realistic and explainable user simulation for information access: https://github.com/irgroup/Sim4IA-Bench.

Related papers

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors [58.87134689752605]
We introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation.<n>We show that even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size.<n>We demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning.
arXiv Detail & Related papers (2025-10-20T13:14:38Z)
Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation [18.225151370273093]
This paper explores a new paradigm: simulating virtual survey respondents using Large Language Models (LLMs)<n>We introduce two novel simulation settings, namely Partial Attribute Simulation (PAS) and Full Attribute Simulation (FAS)<n>We curate a comprehensive benchmark suite, LLM-S3 (Large Language Model-based Sociodemographic Simulation Survey), that spans 11 real-world public datasets across four sociological domains.
arXiv Detail & Related papers (2025-09-08T04:59:00Z)
YuLan-OneSim: Towards the Next Generation of Social Simulator with Large Language Models [50.35333054932747]
We introduce a novel social simulator called YuLan-OneSim.<n>Users can simply describe and refine their simulation scenarios through natural language interactions with our simulator.<n>We implement 50 default simulation scenarios spanning 8 domains, including economics, sociology, politics, psychology, organization, demographics, law, and communication.
arXiv Detail & Related papers (2025-05-12T14:05:17Z)
Exploring Human-Like Thinking in Search Simulations with Large Language Models [9.825091149361208]
Simulating user search behavior is a critical task in information retrieval.<n>Recent advancements in large language models (LLMs) have opened up new possibilities for generating human-like actions.<n>We explore the integration of human-like thinking into search simulations by leveraging LLMs to simulate users' hidden cognitive processes.
arXiv Detail & Related papers (2025-04-10T09:04:58Z)
BeSimulator: A Large Language Model Powered Text-based Behavior Simulator [18.318419980796012]
We propose BeSimulator as an attempt towards behavior simulation in the context of text-based environments.<n>BeSimulator can generalize across scenarios and achieve long-horizon complex simulation.<n>Our experiments show a significant performance improvement in behavior simulation compared to baselines.
arXiv Detail & Related papers (2024-09-24T08:37:04Z)
NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking [65.24988062003096]
We present NAVSIM, a framework for benchmarking vision-based driving policies. Our simulation is non-reactive, i.e., the evaluated policy and environment do not influence each other. NAVSIM enabled a new competition held at CVPR 2024, where 143 teams submitted 463 entries, resulting in several new insights.
arXiv Detail & Related papers (2024-06-21T17:59:02Z)
USimAgent: Large Language Models for Simulating Search Users [33.17004578463697]
We introduce a Large Language Models-based user search behavior simulator, USimAgent. The simulator can simulate users' querying, clicking, and stopping behaviors during search. Empirical investigation on a real user behavior dataset shows that the simulator outperforms existing methods in query generation.
arXiv Detail & Related papers (2024-03-14T07:40:54Z)
BASES: Large-scale Web Search User Simulation with Large Language Model based Agents [108.97507653131917]
BASES is a novel user simulation framework with large language models (LLMs) Our simulation framework can generate unique user profiles at scale, which subsequently leads to diverse search behaviors. WARRIORS is a new large-scale dataset encompassing web search user behaviors, including both Chinese and English versions.
arXiv Detail & Related papers (2024-02-27T13:44:09Z)
Metaphorical User Simulators for Evaluating Task-oriented Dialogue Systems [80.77917437785773]
Task-oriented dialogue systems ( TDSs) are assessed mainly in an offline setting or through human evaluation. We propose a metaphorical user simulator for end-to-end TDS evaluation, where we define a simulator to be metaphorical if it simulates user's analogical thinking in interactions with systems. We also propose a tester-based evaluation framework to generate variants, i.e., dialogue systems with different capabilities.
arXiv Detail & Related papers (2022-04-02T05:11:03Z)
A User's Guide to Calibrating Robotics Simulators [54.85241102329546]
This paper proposes a set of benchmarks and a framework for the study of various algorithms aimed to transfer models and policies learnt in simulation to the real world. We conduct experiments on a wide range of well known simulated environments to characterize and offer insights into the performance of different algorithms. Our analysis can be useful for practitioners working in this area and can help make informed choices about the behavior and main properties of sim-to-real algorithms.
arXiv Detail & Related papers (2020-11-17T22:24:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.