ReqElicitGym: An Evaluation Environment for Interview Competence in Conversational Requirements Elicitation
- URL: http://arxiv.org/abs/2602.18306v1
- Date: Fri, 20 Feb 2026 16:02:13 GMT
- Title: ReqElicitGym: An Evaluation Environment for Interview Competence in Conversational Requirements Elicitation
- Authors: Dongming Jin, Zhi Jin, Zheng Fang, Linyu Li, XiaoTian Yang, Yuanpeng He, Xiaohong Chen,
- Abstract summary: The bottleneck of automated software development is shifting from generating correct code to eliciting users' requirements.<n>Despite growing interest, the interview competence of LLMs in conversational requirements elicitation remains fully underexplored.<n>We propose ReqElicitGym, an interactive and automatic evaluation environment for assessing interview competence in conversational requirements elicitation.
- Score: 36.77382403204434
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid improvement of LLMs' coding capabilities, the bottleneck of LLM-based automated software development is shifting from generating correct code to eliciting users' requirements. Despite growing interest, the interview competence of LLMs in conversational requirements elicitation remains fully underexplored. Existing evaluations often depend on a few scenarios, real user interaction, and subjective human scoring, which hinders systematic and quantitative comparison. To address these challenges, we propose ReqElicitGym, an interactive and automatic evaluation environment for assessing interview competence in conversational requirements elicitation. Specifically, ReqElicitGym introduces a new evaluation dataset and designs both an interactive oracle user and a task evaluator. The dataset contains 101 website requirements elicitation scenarios spanning 10 application types. Both the oracle user and the task evaluator achieve high agreement with real users and expert judgment. Using our ReqElicitGym, any automated conversational requirements elicitation approach (e.g., LLM-based agents) can be evaluated in a reproducible and quantitative manner through interaction with the environment. Based on our ReqElicitGym, we conduct a systematic empirical study on seven representative LLMs, and the results show that current LLMs still exhibit limited interview competence in uncovering implicit requirements. Particularly, they elicit less than half of the users' implicit requirements, and their effective elicitation questions often emerge in later turns of the dialogue. Besides, we found LLMs can elicit interaction and content implicit requirements, but consistently struggle with style-related requirements. We believe ReqElicitGym will facilitate the evaluation and development of automated conversational requirements elicitation.
Related papers
- Teaching Language Models To Gather Information Proactively [53.85419549904644]
Large language models (LLMs) are increasingly expected to function as collaborative partners.<n>In this work, we introduce a new task paradigm: proactive information gathering.<n>We design a scalable framework that generates partially specified, real-world tasks, masking key information.<n>Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information.
arXiv Detail & Related papers (2025-07-28T23:50:09Z) - LLMREI: Automating Requirements Elicitation Interviews with LLMs [47.032121951473435]
This study introduces LLMREI, a chat bot designed to conduct requirements elicitation interviews with minimal human intervention.<n>We evaluated its performance in 33 simulated stakeholder interviews.<n>Our findings indicate that LLMREI makes a similar number of errors compared to human interviewers, is capable of extracting a large portion of requirements, and demonstrates a notable ability to generate highly context-dependent questions.
arXiv Detail & Related papers (2025-07-03T12:18:05Z) - EvalAgent: Discovering Implicit Evaluation Criteria from the Web [82.82096383262068]
We introduce EvalAgent, a framework designed to automatically uncover nuanced and task-specific criteria.<n>EvalAgent mines expert-authored online guidance to propose diverse, long-tail evaluation criteria.<n>Our experiments demonstrate that the grounded criteria produced by EvalAgent are often implicit, yet specific.
arXiv Detail & Related papers (2025-04-21T16:43:50Z) - Using Large Language Models to Develop Requirements Elicitation Skills [1.1473376666000734]
We propose conditioning a large language model to play the role of the client during a chat-based interview.<n>We find that both approaches provide sufficient information for participants to construct technically sound solutions.
arXiv Detail & Related papers (2025-03-10T19:27:38Z) - Dynamic benchmarking framework for LLM-based conversational data capture [0.0]
This paper introduces a benchmarking framework to assess large language models (LLMs)<n>It integrates generative agent simulation to evaluate performance on key dimensions: information extraction, context awareness, and adaptive engagement.<n>Results show that adaptive strategies improve data extraction accuracy, especially when handling ambiguous responses.
arXiv Detail & Related papers (2025-02-04T15:47:47Z) - RECOVER: Toward Requirements Generation from Stakeholders' Conversations [10.706772429994384]
This paper introduces RECOVER, a novel conversational requirements engineering approach.<n>It supports practitioners in automatically extracting system requirements from stakeholder interactions.<n> Empirical evaluation shows promising performance, with generated requirements demonstrating satisfactory correctness, completeness, and actionability.
arXiv Detail & Related papers (2024-11-29T08:52:40Z) - AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs [53.6200736559742]
AGENT-CQ consists of two stages: a generation stage and an evaluation stage.
CrowdLLM simulates human crowdsourcing judgments to assess generated questions and answers.
Experiments on the ClariQ dataset demonstrate CrowdLLM's effectiveness in evaluating question and answer quality.
arXiv Detail & Related papers (2024-10-25T17:06:27Z) - RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues [8.036117602566074]
external retrieval mechanisms are often employed to enhance the quality of augmented generations in dialogues.<n>Existing benchmarks either assess LLMs' chat abilities in multi-turn dialogues or their use of retrieval for augmented responses in single-turn settings.<n>We introduce RAD-Bench, a benchmark designed to evaluate LLMs' capabilities in multi-turn dialogues following retrievals.
arXiv Detail & Related papers (2024-09-19T08:26:45Z) - Elicitron: An LLM Agent-Based Simulation Framework for Design Requirements Elicitation [38.98478510165569]
This paper introduces a novel framework that leverages Large Language Models (LLMs) to automate and enhance the requirements elicitation process.
LLMs are used to generate a vast array of simulated users (LLM agents), enabling the exploration of a much broader range of user needs.
arXiv Detail & Related papers (2024-04-04T17:36:29Z) - Rethinking the Evaluation for Conversational Recommendation in the Era
of Large Language Models [115.7508325840751]
The recent success of large language models (LLMs) has shown great potential to develop more powerful conversational recommender systems (CRSs)
In this paper, we embark on an investigation into the utilization of ChatGPT for conversational recommendation, revealing the inadequacy of the existing evaluation protocol.
We propose an interactive Evaluation approach based on LLMs named iEvaLM that harnesses LLM-based user simulators.
arXiv Detail & Related papers (2023-05-22T15:12:43Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.