TraceLLM: Leveraging Large Language Models with Prompt Engineering for Enhanced Requirements Traceability
- URL: http://arxiv.org/abs/2602.01253v1
- Date: Sun, 01 Feb 2026 14:29:13 GMT
- Title: TraceLLM: Leveraging Large Language Models with Prompt Engineering for Enhanced Requirements Traceability
- Authors: Nouf Alturayeif, Irfan Ahmad, Jameleddine Hassine,
- Abstract summary: This paper introduces TraceLLM, a framework for enhancing requirements traceability through prompt engineering and demonstration selection.<n>We assess prompt generalization and robustness using eight state-of-the-art LLMs on four benchmark datasets.
- Score: 4.517933493143603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Requirements traceability, the process of establishing and maintaining relationships between requirements and various software development artifacts, is paramount for ensuring system integrity and fulfilling requirements throughout the Software Development Life Cycle (SDLC). Traditional methods, including manual and information retrieval models, are labor-intensive, error-prone, and limited by low precision. Recently, Large Language Models (LLMs) have demonstrated potential for supporting software engineering tasks through advanced language comprehension. However, a substantial gap exists in the systematic design and evaluation of prompts tailored to extract accurate trace links. This paper introduces TraceLLM, a systematic framework for enhancing requirements traceability through prompt engineering and demonstration selection. Our approach incorporates rigorous dataset splitting, iterative prompt refinement, enrichment with contextual roles and domain knowledge, and evaluation across zero- and few-shot settings. We assess prompt generalization and robustness using eight state-of-the-art LLMs on four benchmark datasets representing diverse domains (aerospace, healthcare) and artifact types (requirements, design elements, test cases, regulations). TraceLLM achieves state-of-the-art F2 scores, outperforming traditional IR baselines, fine-tuned models, and prior LLM-based methods. We also explore the impact of demonstration selection strategies, identifying label-aware, diversity-based sampling as particularly effective. Overall, our findings highlight that traceability performance depends not only on model capacity but also critically on the quality of prompt engineering. In addition, the achieved performance suggests that TraceLLM can support semi-automated traceability workflows in which candidate links are reviewed and validated by human analysts.
Related papers
- Towards Agentic Intelligence for Materials Science [73.4576385477731]
This survey advances a unique pipeline-centric view that spans from corpus curation and pretraining to goal-conditioned agents interfacing with simulation and experimental platforms.<n>To bridge communities and establish a shared frame of reference, we first present an integrated lens that aligns terminology, evaluation, and workflow stages across AI and materials science.
arXiv Detail & Related papers (2026-01-29T23:48:43Z) - Towards Comprehensive Benchmarking Infrastructure for LLMs In Software Engineering [19.584762693453893]
BEHELM is a holistic benchmarking infrastructure that unifies software-scenario specification with multi-metric evaluation.<n>Our goal is to reduce the overhead currently required to construct benchmarks while enabling a fair, realistic, and future-proof assessment of LLMs in software engineering.
arXiv Detail & Related papers (2026-01-28T21:55:10Z) - Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs [66.63911043019294]
Data preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them.<n>This paper focuses on the use of LLM techniques to prepare data for diverse downstream tasks.<n>We introduce a task-centric taxonomy that organizes the field into three major tasks: data cleaning, standardization, error processing, imputation, data integration, and data enrichment.
arXiv Detail & Related papers (2026-01-22T12:02:45Z) - Data-Driven Methods and AI in Engineering Design: A Systematic Literature Review Focusing on Challenges and Opportunities [0.2545763876632975]
Machine learning and statistical methods dominate current practice, whereas deep learning exhibits a clear upward trend in adoption.<n>Key challenges in existing applications include limited model interpretability, poor cross-stage traceability, and insufficient validation under real-world conditions.<n>This review is a first step toward design-stage guidelines; a follow-up synthesis should map computer science algorithms to engineering design problems and activities.
arXiv Detail & Related papers (2025-11-25T11:16:38Z) - A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System [56.40989626804489]
This survey provides the first holistic analysis of Large Language Models-powered software engineering.<n>We review over 150 recent papers and propose a taxonomy along two key dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, including tasks such as code generation, translation, and repair.
arXiv Detail & Related papers (2025-10-10T06:56:50Z) - LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology [3.470217255779291]
We introduce an evaluation methodology, reference architecture, and open-source implementation that leverages interactive Large Language Model (LLM) agents for runtime data analysis.<n>Our approach uses a lightweight, metadata-driven design that translates natural language into structured provenance queries.<n> Evaluations across LLaMA, GPT, Gemini, and Claude, covering diverse query classes and a real-world chemistry workflow, show that modular design, prompt tuning, and Retrieval-Augmented Generation (RAG) enable accurate and insightful agent responses.
arXiv Detail & Related papers (2025-09-17T13:51:29Z) - Evaluating Large Language Models on Non-Code Software Engineering Tasks [4.381476817430934]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code understanding and generation.<n>We present the first comprehensive benchmark, which we name Software Engineering Language Understanding' (SELU)<n>SELU covers classification, regression, Named Entity Recognition (NER) and Masked Language Modeling (MLM) targets, with data drawn from diverse sources.
arXiv Detail & Related papers (2025-06-12T15:52:32Z) - GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics [9.549568621873386]
GateLens is an LLM-based system for analyzing data in the automotive domain.<n>Unlike traditional multi-agent or planning-based systems that can be slow, opaque, and costly to maintain, GateLens emphasizes speed, transparency, and reliability.
arXiv Detail & Related papers (2025-03-27T17:48:32Z) - Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework [81.29965270493238]
We develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) for wireless communication applications.<n>The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard.<n>We introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data.
arXiv Detail & Related papers (2025-01-16T16:19:53Z) - Large Language Models as Realistic Microservice Trace Generators [48.730974361862366]
This paper proposes a first-of-a-kind approach that relies on training a large language model (LLM) to generate synthetic workload traces.<n>We show that TraceLLM produces diverse, realistic traces under varied conditions, outperforming existing approaches in both accuracy and validity.<n>TraceLLM adapts to downstream trace-related tasks, such as predicting key trace features and infilling missing data.
arXiv Detail & Related papers (2024-12-16T12:48:04Z) - Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.