Empirical Evaluation of ChatGPT on Requirements Information Retrieval
Under Zero-Shot Setting
- URL: http://arxiv.org/abs/2304.12562v2
- Date: Wed, 19 Jul 2023 08:28:45 GMT
- Title: Empirical Evaluation of ChatGPT on Requirements Information Retrieval
Under Zero-Shot Setting
- Authors: Jianzhang Zhang, Yiyang Chen, Nan Niu, Yinglin Wang, Chuang Liu
- Abstract summary: We empirically evaluate ChatGPT's performance on requirements information retrieval tasks.
Under zero-shot setting, evaluation results reveal ChatGPT's promising ability to retrieve requirements relevant information.
- Score: 12.733403458944972
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, various illustrative examples have shown the impressive ability of
generative large language models (LLMs) to perform NLP related tasks. ChatGPT
undoubtedly is the most representative model. We empirically evaluate ChatGPT's
performance on requirements information retrieval (IR) tasks to derive insights
into designing or developing more effective requirements retrieval methods or
tools based on generative LLMs. We design an evaluation framework considering
four different combinations of two popular IR tasks and two common artifact
types. Under zero-shot setting, evaluation results reveal ChatGPT's promising
ability to retrieve requirements relevant information (high recall) and limited
ability to retrieve more specific requirements information (low precision). Our
evaluation of ChatGPT on requirements IR under zero-shot setting provides
preliminary evidence for designing or developing more effective requirements IR
methods or tools based on LLMs.
Related papers
- A Survey of Small Language Models [104.80308007044634]
Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources.
We present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques.
arXiv Detail & Related papers (2024-10-25T23:52:28Z) - Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation [19.312330150540912]
An emerging application is using Large Language Models (LLMs) to enhance retrieval-augmented generation (RAG) capabilities.
We propose FRAMES, a high-quality evaluation dataset designed to test LLMs' ability to provide factual responses.
We present baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval.
arXiv Detail & Related papers (2024-09-19T17:52:07Z) - SFR-RAG: Towards Contextually Faithful LLMs [57.666165819196486]
Retrieval Augmented Generation (RAG) is a paradigm that integrates external contextual information with large language models (LLMs) to enhance factual accuracy and relevance.
We introduce SFR-RAG, a small LLM that is instruction-textual with an emphasis on context-grounded generation and hallucination.
We also present ConBench, a new evaluation framework compiling multiple popular and diverse RAG benchmarks.
arXiv Detail & Related papers (2024-09-16T01:08:18Z) - AI based Multiagent Approach for Requirements Elicitation and Analysis [3.9422957660677476]
This study empirically investigates the effectiveness of utilizing Large Language Models (LLMs) to automate requirements analysis tasks.
We deployed four models, namely GPT-3.5, GPT-4 Omni, LLaMA3-70, and Mixtral-8B, and conducted experiments to analyze requirements on four real-world projects.
Preliminary results indicate notable variations in task completion among the models.
arXiv Detail & Related papers (2024-08-18T07:23:12Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Model Generation with LLMs: From Requirements to UML Sequence Diagrams [9.114284818139069]
This paper investigates the capability of ChatGPT to generate a specific type of model, i.e., sequence diagrams, from NL requirements.
We examine the sequence diagrams generated by ChatGPT for 28 requirements documents of various types and from different domains.
Our results indicate that, although the models generally conform to the standard and exhibit a reasonable level of understandability, their completeness and correctness with respect to the specified requirements often present challenges.
arXiv Detail & Related papers (2024-04-09T15:07:25Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives [2.3420045370973828]
We present the Benchmark of Information Retrieval (IR) tasks with Complex Objectives (BIRCO)
BIRCO evaluates the ability of IR systems to retrieve documents given multi-faceted user objectives.
arXiv Detail & Related papers (2024-02-21T22:22:30Z) - Zero-shot Item-based Recommendation via Multi-task Product Knowledge
Graph Pre-Training [106.85813323510783]
This paper presents a novel paradigm for the Zero-Shot Item-based Recommendation (ZSIR) task.
It pre-trains a model on product knowledge graph (PKG) to refine the item features from PLMs.
We identify three challenges for pre-training PKG, which are multi-type relations in PKG, semantic divergence between item generic information and relations and domain discrepancy from PKG to downstream ZSIR task.
arXiv Detail & Related papers (2023-05-12T17:38:24Z) - Is ChatGPT Good at Search? Investigating Large Language Models as
Re-Ranking Agents [56.104476412839944]
Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks.
This paper investigates generative LLMs for relevance ranking in Information Retrieval (IR)
To address concerns about data contamination of LLMs, we collect a new test set called NovelEval.
To improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models.
arXiv Detail & Related papers (2023-04-19T10:16:03Z) - Extended High Utility Pattern Mining: An Answer Set Programming Based
Framework and Applications [0.0]
Rule-based languages like ASP seem well suited for specifying user-provided criteria to assess pattern utility.
We introduce a new framework that allows for new classes of utility criteria not considered in the previous literature.
We exploit it as a building block for the definition of an innovative method for predicting ICU admission for COVID-19 patients.
arXiv Detail & Related papers (2023-03-23T11:42:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.