Related papers: Empirical Evaluation of ChatGPT on Requirements Information Retrieval Under Zero-Shot Setting

Empirical Evaluation of ChatGPT on Requirements Information Retrieval Under Zero-Shot Setting

URL: http://arxiv.org/abs/2304.12562v2
Date: Wed, 19 Jul 2023 08:28:45 GMT
Title: Empirical Evaluation of ChatGPT on Requirements Information Retrieval Under Zero-Shot Setting
Authors: Jianzhang Zhang, Yiyang Chen, Nan Niu, Yinglin Wang, Chuang Liu
Abstract summary: We empirically evaluate ChatGPT's performance on requirements information retrieval tasks. Under zero-shot setting, evaluation results reveal ChatGPT's promising ability to retrieve requirements relevant information.
Score: 12.733403458944972
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, various illustrative examples have shown the impressive ability of generative large language models (LLMs) to perform NLP related tasks. ChatGPT undoubtedly is the most representative model. We empirically evaluate ChatGPT's performance on requirements information retrieval (IR) tasks to derive insights into designing or developing more effective requirements retrieval methods or tools based on generative LLMs. We design an evaluation framework considering four different combinations of two popular IR tasks and two common artifact types. Under zero-shot setting, evaluation results reveal ChatGPT's promising ability to retrieve requirements relevant information (high recall) and limited ability to retrieve more specific requirements information (low precision). Our evaluation of ChatGPT on requirements IR under zero-shot setting provides preliminary evidence for designing or developing more effective requirements IR methods or tools based on LLMs.

Related papers

Do Retrieval-Augmented Language Models Adapt to Varying User Needs? [28.729041459278587]
This paper introduces a novel evaluation framework that systematically assesses RALMs under three user need cases. By varying both user instructions and the nature of retrieved information, our approach captures the complexities of real-world applications. Our findings highlight the necessity of user-centric evaluations in the development of retrieval-augmented systems.
arXiv Detail & Related papers (2025-02-27T05:39:38Z)
Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework [81.29965270493238]
We develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) for wireless communication applications. The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard. We introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data.
arXiv Detail & Related papers (2025-01-16T16:19:53Z)
Strategic Prompting for Conversational Tasks: A Comparative Analysis of Large Language Models Across Diverse Conversational Tasks [21.079199282600907]
We evaluate the capabilities and limitations of five prevalent Large Language Models: Llama, OPT, Falcon, Alpaca, and MPT. The study encompasses various conversational tasks, including reservation, empathetic response generation, mental health and legal counseling, persuasion, and negotiation.
arXiv Detail & Related papers (2024-11-26T08:21:24Z)
A Survey of Small Language Models [104.80308007044634]
Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources. We present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques.
arXiv Detail & Related papers (2024-10-25T23:52:28Z)
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation [19.312330150540912]
An emerging application is using Large Language Models (LLMs) to enhance retrieval-augmented generation (RAG) capabilities. We propose FRAMES, a high-quality evaluation dataset designed to test LLMs' ability to provide factual responses. We present baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval.
arXiv Detail & Related papers (2024-09-19T17:52:07Z)
SFR-RAG: Towards Contextually Faithful LLMs [57.666165819196486]
Retrieval Augmented Generation (RAG) is a paradigm that integrates external contextual information with large language models (LLMs) to enhance factual accuracy and relevance. We introduce SFR-RAG, a small LLM that is instruction-textual with an emphasis on context-grounded generation and hallucination. We also present ConBench, a new evaluation framework compiling multiple popular and diverse RAG benchmarks.
arXiv Detail & Related papers (2024-09-16T01:08:18Z)
AI based Multiagent Approach for Requirements Elicitation and Analysis [3.9422957660677476]
This study empirically investigates the effectiveness of utilizing Large Language Models (LLMs) to automate requirements analysis tasks. We deployed four models, namely GPT-3.5, GPT-4 Omni, LLaMA3-70, and Mixtral-8B, and conducted experiments to analyze requirements on four real-world projects. Preliminary results indicate notable variations in task completion among the models.
arXiv Detail & Related papers (2024-08-18T07:23:12Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives [2.3420045370973828]
We present the Benchmark of Information Retrieval (IR) tasks with Complex Objectives (BIRCO) BIRCO evaluates the ability of IR systems to retrieve documents given multi-faceted user objectives.
arXiv Detail & Related papers (2024-02-21T22:22:30Z)
Zero-shot Item-based Recommendation via Multi-task Product Knowledge Graph Pre-Training [106.85813323510783]
This paper presents a novel paradigm for the Zero-Shot Item-based Recommendation (ZSIR) task. It pre-trains a model on product knowledge graph (PKG) to refine the item features from PLMs. We identify three challenges for pre-training PKG, which are multi-type relations in PKG, semantic divergence between item generic information and relations and domain discrepancy from PKG to downstream ZSIR task.
arXiv Detail & Related papers (2023-05-12T17:38:24Z)
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents [56.104476412839944]
Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks. This paper investigates generative LLMs for relevance ranking in Information Retrieval (IR) To address concerns about data contamination of LLMs, we collect a new test set called NovelEval. To improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models.
arXiv Detail & Related papers (2023-04-19T10:16:03Z)
Extended High Utility Pattern Mining: An Answer Set Programming Based Framework and Applications [0.0]
Rule-based languages like ASP seem well suited for specifying user-provided criteria to assess pattern utility. We introduce a new framework that allows for new classes of utility criteria not considered in the previous literature. We exploit it as a building block for the definition of an innovative method for predicting ICU admission for COVID-19 patients.
arXiv Detail & Related papers (2023-03-23T11:42:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.