Related papers: Reporting LLM Prompting in Automated Software Engineering: A Guideline Based on Current Practices and Expectations

Reporting LLM Prompting in Automated Software Engineering: A Guideline Based on Current Practices and Expectations

URL: http://arxiv.org/abs/2601.01954v1
Date: Mon, 05 Jan 2026 10:01:20 GMT
Title: Reporting LLM Prompting in Automated Software Engineering: A Guideline Based on Current Practices and Expectations
Authors: Alexander Korn, Lea Zaruchas, Chetan Arora, Andreas Metzger, Sven Smolka, Fanyu Wang, Andreas Vogelsang,
Abstract summary: Large Language Models are increasingly used to automate Software Engineering tasks.<n>These models are guided through natural language prompts, making prompt engineering a critical factor in system performance and behavior.<n>Despite their growing role in SE research, prompt-related decisions are rarely documented in a systematic or transparent manner.
Score: 39.62249759297524
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models, particularly decoder-only generative models such as GPT, are increasingly used to automate Software Engineering tasks. These models are primarily guided through natural language prompts, making prompt engineering a critical factor in system performance and behavior. Despite their growing role in SE research, prompt-related decisions are rarely documented in a systematic or transparent manner, hindering reproducibility and comparability across studies. To address this gap, we conducted a two-phase empirical study. First, we analyzed nearly 300 papers published at the top-3 SE conferences since 2022 to assess how prompt design, testing, and optimization are currently reported. Second, we surveyed 105 program committee members from these conferences to capture their expectations for prompt reporting in LLM-driven research. Based on the findings, we derived a structured guideline that distinguishes essential, desirable, and exceptional reporting elements. Our results reveal significant misalignment between current practices and reviewer expectations, particularly regarding version disclosure, prompt justification, and threats to validity. We present our guideline as a step toward improving transparency, reproducibility, and methodological rigor in LLM-based SE research.

Related papers

TraceLLM: Leveraging Large Language Models with Prompt Engineering for Enhanced Requirements Traceability [4.517933493143603]
This paper introduces TraceLLM, a framework for enhancing requirements traceability through prompt engineering and demonstration selection.<n>We assess prompt generalization and robustness using eight state-of-the-art LLMs on four benchmark datasets.
arXiv Detail & Related papers (2026-02-01T14:29:13Z)
Large Language Models (LLMs) for Requirements Engineering (RE): A Systematic Literature Review [2.0061679654181392]
The study categorizes the literature according to several dimensions, including publication trends, RE activities, prompting strategies, and evaluation methods.<n>Most of the studies focus on using LLMs for requirements elicitation and validation, rather than defect detection and classification.<n>Other artifacts are increasingly considered, including issues from issue tracking systems, regulations, and technical manuals.
arXiv Detail & Related papers (2025-09-14T21:45:01Z)
How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective [64.00022624183781]
Large language models (LLMs) can assess relevance and support information retrieval (IR) tasks.<n>We investigate how different LLM modules contribute to relevance judgment through the lens of mechanistic interpretability.
arXiv Detail & Related papers (2025-04-10T16:14:55Z)
Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored. We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents [30.603079363363634]
This study introduces ResearchArena, a benchmark designed to evaluate large language models' capabilities in conducting academic surveys.<n>ResearchArena models the process in three stages: (1) information discovery, identifying relevant literature; (2) information selection, evaluating papers' relevance and impact; and (3) information organization.<n>To support these evaluations, we construct an offline environment of 12M full-text academic papers and 7.9K survey papers.
arXiv Detail & Related papers (2024-06-13T03:26:30Z)
Thinking Fair and Slow: On the Efficacy of Structured Prompts for Debiasing Language Models [14.405446719317291]
Existing debiasing techniques are typically training-based or require access to the model's internals and output distributions. We evaluate a comprehensive end-user-focused iterative framework of debiasing that applies System 2 thinking processes for prompts to induce logical, reflective, and critical text generation.
arXiv Detail & Related papers (2024-05-16T20:27:58Z)
Generative transformations and patterns in LLM-native approaches for software verification and falsification [1.4595796095047369]
We argue that a foundational step towards a more disciplined engineering practice is a systematic understanding of the core functional units-generative transformations.<n>We first present a fine-grained taxonomy of generative transformations, abstracting prompt-based interactions into conceptual signatures.<n>Our analysis not only validates the utility of the taxonomy but also surfaces strategic gaps and cross-dimensional relationships.
arXiv Detail & Related papers (2024-04-14T23:45:23Z)
Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs) We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing. We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z)
How Effective are Large Language Models in Generating Software Specifications? [14.170320751508502]
Large Language Models (LLMs) have been successfully applied to numerous Software Engineering (SE) tasks.<n>We conduct the first empirical study to evaluate the capabilities of LLMs for generating software specifications from software comments or documentation.
arXiv Detail & Related papers (2023-06-06T00:28:39Z)
Investigating Fairness Disparities in Peer Review: A Language Model Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs) We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date. We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.