Related papers: On the Impact of Requirements Smells in Prompts: The Case of Automated Traceability

On the Impact of Requirements Smells in Prompts: The Case of Automated Traceability

URL: http://arxiv.org/abs/2501.04810v1
Date: Wed, 08 Jan 2025 19:54:31 GMT
Title: On the Impact of Requirements Smells in Prompts: The Case of Automated Traceability
Authors: Andreas Vogelsang, Alexander Korn, Giovanna Broccia, Alessio Ferrari, Jannik Fischbach, Chetan Arora,
Abstract summary: We investigate the role of requirements smells-indicators of potential issues like ambiguity and inconsistency-when used in prompts for large language models (LLMs)<n>Our results show mixed outcomes: while requirements smells had a small but significant effect when predicting whether a requirement was implemented in a piece of code (i.e., a trace link exists), no significant effect was observed when tracing the requirements with the associated lines of code.<n>These findings suggest that requirements smells can affect LLM performance in certain SE tasks but may not uniformly impact all tasks.
Score: 45.24937784556523
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly used to generate software artifacts, such as source code, tests, and trace links. Requirements play a central role in shaping the input prompts that guide LLMs, as they are often used as part of the prompts to synthesize the artifacts. However, the impact of requirements formulation on LLM performance remains unclear. In this paper, we investigate the role of requirements smells-indicators of potential issues like ambiguity and inconsistency-when used in prompts for LLMs. We conducted experiments using two LLMs focusing on automated trace link generation between requirements and code. Our results show mixed outcomes: while requirements smells had a small but significant effect when predicting whether a requirement was implemented in a piece of code (i.e., a trace link exists), no significant effect was observed when tracing the requirements with the associated lines of code. These findings suggest that requirements smells can affect LLM performance in certain SE tasks but may not uniformly impact all tasks. We highlight the need for further research to understand these nuances and propose future work toward developing guidelines for mitigating the negative effects of requirements smells in AI-driven SE processes.

Related papers

Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs [14.334903198382287]
It remains unclear whether large language models can produce outputs aligned with a broad variety of user goals.<n> Interventions to improve steerability, such as prompt engineering, have varying effectiveness.<n>Even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient.
arXiv Detail & Related papers (2025-05-27T21:29:52Z)
How Effective are Generative Large Language Models in Performing Requirements Classification? [4.429729688079712]
This study explores the effectiveness of three generative large language models (LLMs) performing both binary and multi-class requirements classification. Our study concludes that while factors like prompt design and LLM architecture are universally important, others-such as dataset variations-have a more situational impact, depending on the complexity of the classification task.
arXiv Detail & Related papers (2025-04-23T14:41:11Z)
SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
SpecTool is a new benchmark to identify error patterns in LLM output on tool-use tasks. We show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z)
Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting [23.61061000692023]
This study proposes leveraging user interactions recorded in search logs to yield insights into users' implicit search intentions.<n>We propose ProRBP, a novel Progressive Retrieved Behavior-augmented Prompting framework for integrating search scenario-oriented knowledge with Large Language Models.
arXiv Detail & Related papers (2024-08-18T11:07:38Z)
Are LLMs Good Annotators for Discourse-level Event Relation Extraction? [15.365993658296016]
We assess the effectiveness of Large Language Models (LLMs) in addressing discourse-level event relation extraction tasks. Evaluation is conducted using an commercial model, GPT-3.5, and an open-source model, LLaMA-2.
arXiv Detail & Related papers (2024-07-28T19:27:06Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering [8.019873464066308]
We introduce two metrics for classification tasks, namely sensitivity and consistency. sensitivity measures changes of predictions across rephrasings of the prompt. Instead, consistency measures how predictions vary across rephrasings for elements of the same class.
arXiv Detail & Related papers (2024-06-18T06:59:24Z)
Guiding LLM Temporal Logic Generation with Explicit Separation of Data and Control [0.7580487359358722]
Temporal logics are powerful tools that are widely used for the synthesis and verification of reactive systems. Recent progress on Large Language Models has the potential to make the process of writing such specifications more accessible.
arXiv Detail & Related papers (2024-06-11T16:07:24Z)
Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users. We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set. We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
Feedback Loops With Language Models Drive In-Context Reward Hacking [78.9830398771605]
We show that feedback loops can cause in-context reward hacking (ICRH) We identify and study two processes that lead to ICRH: output-refinement and policy-refinement. As AI development accelerates, the effects of feedback loops will proliferate.
arXiv Detail & Related papers (2024-02-09T18:59:29Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks. This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.