ChatGPT as a tool for User Story Quality Evaluation: Trustworthy Out of
the Box?
- URL: http://arxiv.org/abs/2306.12132v1
- Date: Wed, 21 Jun 2023 09:26:27 GMT
- Title: ChatGPT as a tool for User Story Quality Evaluation: Trustworthy Out of
the Box?
- Authors: Krishna Ronanki, Beatriz Cabrero-Daniel, and Christian Berger
- Abstract summary: This study explores using ChatGPT for user story quality evaluation and compares its performance with an existing benchmark.
Our study shows that ChatGPT's evaluation aligns well with human evaluation, and we propose a best of three'' strategy to improve its output stability.
- Score: 3.6526713965824515
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Agile software development, user stories play a vital role in capturing
and conveying end-user needs, prioritizing features, and facilitating
communication and collaboration within development teams. However, automated
methods for evaluating user stories require training in NLP tools and can be
time-consuming to develop and integrate. This study explores using ChatGPT for
user story quality evaluation and compares its performance with an existing
benchmark. Our study shows that ChatGPT's evaluation aligns well with human
evaluation, and we propose a ``best of three'' strategy to improve its output
stability. We also discuss the concept of trustworthiness in AI and its
implications for non-experts using ChatGPT's unprocessed outputs. Our research
contributes to understanding the reliability and applicability of AI in user
story evaluation and offers recommendations for future research.
Related papers
- USeR: A Web-based User Story eReviewer for Assisted Quality Optimizations [2.746265158172294]
Multiple user story quality guidelines exist, but authors like Product Owners in industry projects frequently fail to write high-quality user stories.
This situation is exacerbated by the lack of tools for assessing user story quality.
We propose User Story eReviewer (USeR) a web-based tool that allows authors to determine and optimize user story quality.
arXiv Detail & Related papers (2025-03-03T21:02:10Z) - User Story Tutor (UST) to Support Agile Software Developers [0.4077787659104315]
We designed, implemented, applied, and evaluated a web application called User Story Tutor (UST)
UST checks the description of a given User Story for readability, and if needed, recommends appropriate practices for improvement.
UST may support the continuing education of agile development teams when writing and reviewing User Stories.
arXiv Detail & Related papers (2024-06-24T01:55:01Z) - Chatbots Are Not Reliable Text Annotators [0.0]
ChatGPT is a closed-source product which has major drawbacks with regards to transparency, cost, and data protection.
Recent advances in open-source (OS) large language models (LLMs) offer alternatives which remedy these challenges.
arXiv Detail & Related papers (2023-11-09T22:28:14Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z) - ChatGPT is a Remarkable Tool -- For Experts [9.46644539427004]
We explore the potential of ChatGPT to enhance productivity, streamline problem-solving processes, and improve writing style.
We highlight the potential risks associated with excessive reliance on ChatGPT in these fields.
We outline areas and objectives where ChatGPT proves beneficial, applications where it should be used judiciously, and scenarios where its reliability may be limited.
arXiv Detail & Related papers (2023-06-02T06:28:21Z) - Exploring the Efficacy of ChatGPT in Analyzing Student Teamwork Feedback
with an Existing Taxonomy [0.0]
ChatGPT can achieve over 90% accuracy in labeling student comments.
This study contributes to the growing body of research on the use of AI models in educational contexts.
arXiv Detail & Related papers (2023-05-09T19:55:50Z) - ChatLog: Carefully Evaluating the Evolution of ChatGPT Across Time [54.18651663847874]
ChatGPT has achieved great success and can be considered to have acquired an infrastructural status.
Existing benchmarks encounter two challenges: (1) Disregard for periodical evaluation and (2) Lack of fine-grained features.
We construct ChatLog, an ever-updating dataset with large-scale records of diverse long-form ChatGPT responses for 21 NLP benchmarks from March, 2023 to now.
arXiv Detail & Related papers (2023-04-27T11:33:48Z) - Exploring the Use of Large Language Models for Reference-Free Text
Quality Evaluation: An Empirical Study [63.27346930921658]
ChatGPT is capable of evaluating text quality effectively from various perspectives without reference.
The Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches.
arXiv Detail & Related papers (2023-04-03T05:29:58Z) - Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric.
Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments.
We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z) - On the Robustness of ChatGPT: An Adversarial and Out-of-distribution
Perspective [67.98821225810204]
We evaluate the robustness of ChatGPT from the adversarial and out-of-distribution perspective.
Results show consistent advantages on most adversarial and OOD classification and translation tasks.
ChatGPT shows astounding performance in understanding dialogue-related texts.
arXiv Detail & Related papers (2023-02-22T11:01:20Z) - FEBR: Expert-Based Recommendation Framework for beneficial and
personalized content [77.86290991564829]
We propose FEBR (Expert-Based Recommendation Framework), an apprenticeship learning framework to assess the quality of the recommended content.
The framework exploits the demonstrated trajectories of an expert (assumed to be reliable) in a recommendation evaluation environment, to recover an unknown utility function.
We evaluate the performance of our solution through a user interest simulation environment (using RecSim)
arXiv Detail & Related papers (2021-07-17T18:21:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.