Empirical Evaluation of ChatGPT on Requirements Information Retrieval
Under Zero-Shot Setting
- URL: http://arxiv.org/abs/2304.12562v2
- Date: Wed, 19 Jul 2023 08:28:45 GMT
- Title: Empirical Evaluation of ChatGPT on Requirements Information Retrieval
Under Zero-Shot Setting
- Authors: Jianzhang Zhang, Yiyang Chen, Nan Niu, Yinglin Wang, Chuang Liu
- Abstract summary: We empirically evaluate ChatGPT's performance on requirements information retrieval tasks.
Under zero-shot setting, evaluation results reveal ChatGPT's promising ability to retrieve requirements relevant information.
- Score: 12.733403458944972
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, various illustrative examples have shown the impressive ability of
generative large language models (LLMs) to perform NLP related tasks. ChatGPT
undoubtedly is the most representative model. We empirically evaluate ChatGPT's
performance on requirements information retrieval (IR) tasks to derive insights
into designing or developing more effective requirements retrieval methods or
tools based on generative LLMs. We design an evaluation framework considering
four different combinations of two popular IR tasks and two common artifact
types. Under zero-shot setting, evaluation results reveal ChatGPT's promising
ability to retrieve requirements relevant information (high recall) and limited
ability to retrieve more specific requirements information (low precision). Our
evaluation of ChatGPT on requirements IR under zero-shot setting provides
preliminary evidence for designing or developing more effective requirements IR
methods or tools based on LLMs.
Related papers
- Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - RepEval: Effective Text Evaluation with LLM Representation [54.07909112633993]
We introduce RepEval, the first metric leveraging the projection of LLM representations for evaluation.
RepEval requires minimal sample pairs for training, and through simple prompt modifications, it can easily transition to various tasks.
Results on ten datasets from three tasks demonstrate the high effectiveness of our method.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Model Generation with LLMs: From Requirements to UML Sequence Diagrams [9.114284818139069]
This paper investigates the capability of ChatGPT to generate a specific type of model, i.e., sequence diagrams, from NL requirements.
We examine the sequence diagrams generated by ChatGPT for 28 requirements documents of various types and from different domains.
Our results indicate that, although the models generally conform to the standard and exhibit a reasonable level of understandability, their completeness and correctness with respect to the specified requirements often present challenges.
arXiv Detail & Related papers (2024-04-09T15:07:25Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives [2.3420045370973828]
We present the Benchmark of Information Retrieval (IR) tasks with Complex Objectives (BIRCO)
BIRCO evaluates the ability of IR systems to retrieve documents given multi-faceted user objectives.
arXiv Detail & Related papers (2024-02-21T22:22:30Z) - The Shifted and The Overlooked: A Task-oriented Investigation of
User-GPT Interactions [114.67699010359637]
We analyze a large-scale collection of real user queries to GPT.
We find that tasks such as design'' and planning'' are prevalent in user interactions but are largely neglected or different from traditional NLP benchmarks.
arXiv Detail & Related papers (2023-10-19T02:12:17Z) - Fine-tuning and aligning question answering models for complex
information extraction tasks [0.8392546351624164]
extractive language models like question answering (QA) or passage retrieval models guarantee query results to be found within the boundaries of an according context document.
We show that fine-tuning existing German QA models boosts performance for tailored extraction tasks of complex linguistic features.
We deduce a combined metric from Levenshtein distance, F1-Score, Exact Match and ROUGE-L to mimic the assessment criteria from human experts.
arXiv Detail & Related papers (2023-09-26T10:02:21Z) - Zero-shot Item-based Recommendation via Multi-task Product Knowledge
Graph Pre-Training [106.85813323510783]
This paper presents a novel paradigm for the Zero-Shot Item-based Recommendation (ZSIR) task.
It pre-trains a model on product knowledge graph (PKG) to refine the item features from PLMs.
We identify three challenges for pre-training PKG, which are multi-type relations in PKG, semantic divergence between item generic information and relations and domain discrepancy from PKG to downstream ZSIR task.
arXiv Detail & Related papers (2023-05-12T17:38:24Z) - Is ChatGPT Good at Search? Investigating Large Language Models as
Re-Ranking Agents [56.104476412839944]
Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks.
This paper investigates generative LLMs for relevance ranking in Information Retrieval (IR)
To address concerns about data contamination of LLMs, we collect a new test set called NovelEval.
To improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models.
arXiv Detail & Related papers (2023-04-19T10:16:03Z) - GPT4Rec: A Generative Framework for Personalized Recommendation and User
Interests Interpretation [8.293646972329581]
GPT4Rec is a novel and flexible generative framework inspired by search engines.
It first generates hypothetical "search queries" given item titles in a user's history, and then retrieves items for recommendation by searching these queries.
Our framework outperforms state-of-the-art methods by $75.7%$ and $22.2%$ in Recall@K on two public datasets.
arXiv Detail & Related papers (2023-04-08T00:30:08Z) - Extended High Utility Pattern Mining: An Answer Set Programming Based
Framework and Applications [0.0]
Rule-based languages like ASP seem well suited for specifying user-provided criteria to assess pattern utility.
We introduce a new framework that allows for new classes of utility criteria not considered in the previous literature.
We exploit it as a building block for the definition of an innovative method for predicting ICU admission for COVID-19 patients.
arXiv Detail & Related papers (2023-03-23T11:42:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.