DeepQuali: Initial results of a study on the use of large language models for assessing the quality of user stories
- URL: http://arxiv.org/abs/2602.08887v1
- Date: Mon, 09 Feb 2026 16:49:54 GMT
- Title: DeepQuali: Initial results of a study on the use of large language models for assessing the quality of user stories
- Authors: Adam Trendowicz, Daniel Seifert, Andreas Jedlitschka, Marcus Ciolkowski, Anton Strahilov,
- Abstract summary: Generative artificial intelligence (GAI) is increasingly used in software engineering, mainly for coding tasks.<n>The current focus of using GAI for requirements is on eliciting, transforming, and classifying requirements, not on quality assessment.<n>We propose and evaluate the LLM-based (GPT-4o) approach "DeepQuali", for assessing and improving requirements quality in agile software development.
- Score: 0.40451653578314795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative artificial intelligence (GAI), specifically large language models (LLMs), are increasingly used in software engineering, mainly for coding tasks. However, requirements engineering - particularly requirements validation - has seen limited application of GAI. The current focus of using GAI for requirements is on eliciting, transforming, and classifying requirements, not on quality assessment. We propose and evaluate the LLM-based (GPT-4o) approach "DeepQuali", for assessing and improving requirements quality in agile software development. We applied it to projects in two small companies, where we compared LLM-based quality assessments with expert judgments. Experts also participated in walkthroughs of the solution, provided feedback, and rated their acceptance of the approach. Experts largely agreed with the LLM's quality assessments, especially regarding overall ratings and explanations. However, they did not always agree with the other experts on detailed ratings, suggesting that expertise and experience may influence judgments. Experts recognized the usefulness of the approach but criticized the lack of integration into their workflow. LLMs show potential in supporting software engineers with the quality assessment and improvement of requirements. The explicit use of quality models and explanatory feedback increases acceptance.
Related papers
- Applying a Requirements-Focused Agile Management Approach for Machine Learning-Enabled Systems [1.3704574906282525]
Machine Learning (ML)-enabled systems challenge traditional Requirements Engineering (RE) and agile management.<n>Existing RE and agile practices remain poorly integrated and insufficiently tailored to these characteristics.<n>This paper reports on the practical experience of applying RefineML, a requirements-focused approach for the continuous and agile refinement of ML-enabled systems.
arXiv Detail & Related papers (2026-02-04T20:49:02Z) - Practitioner Insights on Fairness Requirements in the AI Development Life Cycle: An Interview Study [3.5429774642987915]
We conducted research on fairness requirements in AI from software engineering perspective.<n>Our study assesses the participants' awareness of fairness in AI / ML software and its application within the Software Development Life Cycle (SDLC)<n>Findings show that while our participants recognize the aforementioned AI fairness dimensions, practices are inconsistent, and fairness is often deprioritized with noticeable knowledge gaps.
arXiv Detail & Related papers (2025-12-15T19:12:34Z) - "You Are Rejected!": An Empirical Study of Large Language Models Taking Hiring Evaluations [1.1254231171451319]
This paper investigates whether Large Language Models (LLMs) can pass hiring evaluations.<n>We employ state-of-the-art LLMs to generate responses and subsequently evaluate their performance.<n>Contrary to any prior expectation of LLMs being ideal engineers, our analysis reveals a significant inconsistency between the model-generated answers and the company-referenced solutions.
arXiv Detail & Related papers (2025-10-22T01:59:30Z) - Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z) - Multi-Modal Requirements Data-based Acceptance Criteria Generation using LLMs [17.373348983049176]
We propose RAGcceptance M2RE, a novel approach to generate acceptance criteria from multi-modal requirements data.<n>We show that our approach effectively reduces manual effort, captures nuanced stakeholder intent, and provides valuable criteria.<n>This research underscores the potential of multi-modal RAG techniques in streamlining software validation processes and improving development efficiency.
arXiv Detail & Related papers (2025-08-09T08:35:40Z) - MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z) - The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z) - Rethinking Machine Unlearning in Image Generation Models [59.697750585491264]
CatIGMU is a novel hierarchical task categorization framework.<n>EvalIGMU is a comprehensive evaluation framework.<n>We construct DataIGM, a high-quality unlearning dataset.
arXiv Detail & Related papers (2025-06-03T11:25:14Z) - Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models [92.66784679667441]
Prometheus 2 is a more powerful evaluator LM that closely mirrors human and GPT-4 judgements.<n>It is capable of processing both direct assessment and pairwise ranking formats grouped with a user-defined evaluation criteria.<n>On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges.
arXiv Detail & Related papers (2024-05-02T17:59:35Z) - Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference.
Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels.
Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.