Related papers: Using LLMs in Software Requirements Specifications: An Empirical Evaluation

Using LLMs in Software Requirements Specifications: An Empirical Evaluation

URL: http://arxiv.org/abs/2404.17842v1
Date: Sat, 27 Apr 2024 09:37:00 GMT
Title: Using LLMs in Software Requirements Specifications: An Empirical Evaluation
Authors: Madhava Krishna, Bhagesh Gaur, Arsh Verma, Pankaj Jalote,
Abstract summary: We assess the performance of GPT-4 and CodeLlama in drafting an Software Requirements Specification. Our results suggest that LLMs can match the output quality of an entry-level software engineer to generate an SRS. We conclude that the LLMs can be gainfully used by software engineers to increase productivity.
Score: 0.2812395851874055
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The creation of a Software Requirements Specification (SRS) document is important for any software development project. Given the recent prowess of Large Language Models (LLMs) in answering natural language queries and generating sophisticated textual outputs, our study explores their capability to produce accurate, coherent, and structured drafts of these documents to accelerate the software development lifecycle. We assess the performance of GPT-4 and CodeLlama in drafting an SRS for a university club management system and compare it against human benchmarks using eight distinct criteria. Our results suggest that LLMs can match the output quality of an entry-level software engineer to generate an SRS, delivering complete and consistent drafts. We also evaluate the capabilities of LLMs to identify and rectify problems in a given requirements document. Our experiments indicate that GPT-4 is capable of identifying issues and giving constructive feedback for rectifying them, while CodeLlama's results for validation were not as encouraging. We repeated the generation exercise for four distinct use cases to study the time saved by employing LLMs for SRS generation. The experiment demonstrates that LLMs may facilitate a significant reduction in development time for entry-level software engineers. Hence, we conclude that the LLMs can be gainfully used by software engineers to increase productivity by saving time and effort in generating, validating and rectifying software requirements.

Related papers

RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis [78.32151470154422]
We introduce RAVEL, an agentic framework that enables the testers to autonomously plan and execute typical synthesis operations.<n>We present C3EBench, a benchmark comprising 1,258 samples derived from professional human writings.<n>By augmenting RAVEL with SOTA LLMs as operators, we find that such agentic text synthesis is dominated by the LLM's reasoning capability.
arXiv Detail & Related papers (2026-02-28T14:47:34Z)
Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement [8.059802912761919]
We uncover a systematic failure of large language models (LLMs) in matching code to natural language requirements.<n>More detailed prompt design, particularly with those requiring explanations and proposed corrections, leads to higher misjudgment rates.<n>We propose a Fix-guided Verification Filter that treats the model proposed fix as executable counterfactual evidence.
arXiv Detail & Related papers (2026-02-28T08:35:25Z)
Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications [0.6813925418351435]
Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks.<n>In this paper, we uncover a systematic failure of LLMs in evaluating whether code aligns with natural language requirements.<n>Our results reveal that LLMs frequently misclassify correct code implementations as either not satisfying requirements'' or containing potential defects.
arXiv Detail & Related papers (2025-08-17T13:07:26Z)
Augmenting Large Language Models with Static Code Analysis for Automated Code Quality Improvements [0.36832029288386137]
This study examined code issue detection and revision automation by integrating Large Language Models (LLMs) into software development.<n>A static code analysis framework detects issues such as bugs, vulnerabilities, and code smells within a large-scale software project.<n>Retrieval-augmented generation (RAG) is implemented to enhance the relevance and precision of the revisions.
arXiv Detail & Related papers (2025-06-12T03:39:25Z)
An Empirical Study of Many-to-Many Summarization with Large Language Models [82.10000188179168]
Large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform Many-to-many summarization (M2MS) in real applications.<n>This work presents a systematic empirical study on LLMs' M2MS ability.
arXiv Detail & Related papers (2025-05-19T11:18:54Z)
LLMs' Reshaping of People, Processes, Products, and Society in Software Development: A Comprehensive Exploration with Early Adopters [3.4069804433026314]
Large language models (LLMs) like OpenAI ChatGPT, Google Gemini, and GitHub Copilot are rapidly gaining traction in the software industry. Our study provides a nuanced understanding of how LLMs are shaping the landscape of software development.
arXiv Detail & Related papers (2025-03-06T22:27:05Z)
Assessing LLMs for Front-end Software Architecture Knowledge [0.0]
Large Language Models (LLMs) have demonstrated significant promise in automating software development tasks. This study investigates the capabilities of an LLM in understanding, reproducing, and generating structures within the VIPER architecture. Experimental results, using ChatGPT 4 Turbo 2024-04-09, reveal that the LLM excelled in higher-order tasks like evaluating and creating, but faced challenges with lower-order tasks requiring precise retrieval of architectural details.
arXiv Detail & Related papers (2025-02-26T19:33:35Z)
Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis [78.07225438556203]
We introduce LLM-Oasis, the largest resource for training end-to-end factuality evaluators. It is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts. We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for factuality evaluation systems.
arXiv Detail & Related papers (2024-11-29T12:21:15Z)
A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models [11.087034068992653]
FAUN-Eval is a benchmark specifically designed to evaluate the Fine-grAined issUe solviNg capabilities of LLMs. It is constructed using a dataset curated from 30 well-known GitHub repositories. We evaluate ten LLMs with FAUN-Eval, including four closed-source and six open-source models.
arXiv Detail & Related papers (2024-11-27T03:25:44Z)
Studying and Benchmarking Large Language Models For Log Level Suggestion [49.176736212364496]
Large Language Models (LLMs) have become a focal point of research across various domains. This paper investigates the impact of characteristics and learning paradigms on the performance of 12 open-source LLMs in log level suggestion.
arXiv Detail & Related papers (2024-10-11T03:52:17Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Requirements are All You Need: From Requirements to Code with LLMs [0.0]
Large language models (LLMs) can be applied to software engineering tasks. This paper introduces a tailored LLM for automating the generation of code snippets from well-structured requirements documents. We demonstrate the LLM's proficiency in comprehending intricate user requirements and producing robust design and code solutions.
arXiv Detail & Related papers (2024-06-14T14:57:35Z)
Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study [72.24266814625685]
We explore the performance of large language models (LLMs) across the entire software development lifecycle with DevEval. DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task. Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval.
arXiv Detail & Related papers (2024-03-13T15:13:44Z)
PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z)
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming [12.355284125578342]
Large Language Models (LLMs) have become a focal point in modern software development. LLMs offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants. However, each system requires the LLM to be honed to its set of workspaces to ensure the best performance.
arXiv Detail & Related papers (2024-02-22T03:51:34Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
Testing LLMs on Code Generation with Varying Levels of Prompt Specificity [0.0]
Large language models (LLMs) have demonstrated unparalleled prowess in mimicking human-like text generation and processing. The potential to transform natural language prompts into executable code promises a major shift in software development practices.
arXiv Detail & Related papers (2023-11-10T23:41:41Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
Towards an Understanding of Large Language Models in Software Engineering Tasks [29.30433406449331]
Large Language Models (LLMs) have drawn widespread attention and research due to their astounding performance in text generation and reasoning tasks. The evaluation and optimization of LLMs in software engineering tasks, such as code generation, have become a research focus. This paper comprehensively investigate and collate the research and products combining LLMs with software engineering.
arXiv Detail & Related papers (2023-08-22T12:37:29Z)
Software Testing with Large Language Models: Survey, Landscape, and Vision [32.34617250991638]
Pre-trained large language models (LLMs) have emerged as a breakthrough technology in natural language processing and artificial intelligence. This paper provides a comprehensive review of the utilization of LLMs in software testing.
arXiv Detail & Related papers (2023-07-14T08:26:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.