Compiling Prompts, Not Crafting Them: A Reproducible Workflow for AI-Assisted Evidence Synthesis
- URL: http://arxiv.org/abs/2509.00038v1
- Date: Fri, 22 Aug 2025 21:37:49 GMT
- Title: Compiling Prompts, Not Crafting Them: A Reproducible Workflow for AI-Assisted Evidence Synthesis
- Authors: Teo Susnjak,
- Abstract summary: Large language models (LLMs) offer significant potential to accelerate systematic literature reviews.<n>Current approaches often rely on brittle, manually crafted prompts that compromise reliability and rigour.<n>This research proposes a structured, domain-specific framework that embeds task declarations, test suites, and automated prompt tuning into a reproducible SLR.
- Score: 1.624454100511275
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) offer significant potential to accelerate systematic literature reviews (SLRs), yet current approaches often rely on brittle, manually crafted prompts that compromise reliability and reproducibility. This fragility undermines scientific confidence in LLM-assisted evidence synthesis. In response, this work adapts recent advances in declarative prompt optimisation, developed for general-purpose LLM applications, and demonstrates their applicability to the domain of SLR automation. This research proposes a structured, domain-specific framework that embeds task declarations, test suites, and automated prompt tuning into a reproducible SLR workflow. These emerging methods are translated into a concrete blueprint with working code examples, enabling researchers to construct verifiable LLM pipelines that align with established principles of transparency and rigour in evidence synthesis. This is a novel application of such approaches to SLR pipelines.
Related papers
- RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis [78.32151470154422]
We introduce RAVEL, an agentic framework that enables the testers to autonomously plan and execute typical synthesis operations.<n>We present C3EBench, a benchmark comprising 1,258 samples derived from professional human writings.<n>By augmenting RAVEL with SOTA LLMs as operators, we find that such agentic text synthesis is dominated by the LLM's reasoning capability.
arXiv Detail & Related papers (2026-02-28T14:47:34Z) - Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.83217247686402]
Large Language Models (LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions.<n>Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance.<n>This paper decomposes LLM applications into a three-layer architecture: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, and textbftextitLLM Inference Core.
arXiv Detail & Related papers (2025-08-28T13:00:28Z) - AI-Driven Scholarly Peer Review via Persistent Workflow Prompting, Meta-Prompting, and Meta-Reasoning [0.0]
This report introduces Persistent Prompting (PWP), a potentially broadly applicable prompt engineering methodology.<n>We present a proof-of-concept PWP prompt for the critical analysis of experimental chemistry manuscripts.<n>We develop this PWP prompt through iterative application of meta-prompting techniques and meta-reasoning aimed at systematically codifying expert review.
arXiv Detail & Related papers (2025-05-06T09:06:18Z) - APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries [5.227446378450704]
APE-Bench I is the first realistic benchmark built from real-world commit histories of Mathlib4.<n>Eleanstic is a scalable parallel verification infrastructure optimized for proof checking across multiple versions of Mathlib.
arXiv Detail & Related papers (2025-04-27T05:04:02Z) - LLM-AutoDiff: Auto-Differentiate Any LLM Workflow [58.56731133392544]
We introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering (APE)<n>LLMs-AutoDiff treats each textual input as a trainable parameter and uses a frozen backward engine to generate feedback-akin to textual gradients.<n>It consistently outperforms existing textual gradient baselines in both accuracy and training cost.
arXiv Detail & Related papers (2025-01-28T03:18:48Z) - Unleashing the Power of Large Language Models in Zero-shot Relation Extraction via Self-Prompting [21.04933334040135]
We introduce the Self-Prompting framework, a novel method designed to fully harness the embedded RE knowledge within Large Language Models.<n>Our framework employs a three-stage diversity approach to prompt LLMs, generating multiple synthetic samples that encapsulate specific relations from scratch.<n> Experimental evaluations on benchmark datasets show our approach outperforms existing LLM-based zero-shot RE methods.
arXiv Detail & Related papers (2024-10-02T01:12:54Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Automating Research Synthesis with Domain-Specific Large Language Model Fine-Tuning [0.9110413356918055]
This research pioneers the use of fine-tuned Large Language Models (LLMs) to automate Systematic Literature Reviews ( SLRs)
Our study employed the latest fine-tuning methodologies together with open-sourced LLMs, and demonstrated a practical and efficient approach to automating the final execution stages of an SLR process.
The results maintained high fidelity in factual accuracy in LLM responses, and were validated through the replication of an existing PRISMA-conforming SLR.
arXiv Detail & Related papers (2024-04-08T00:08:29Z) - Mitigating Catastrophic Forgetting in Large Language Models with Self-Synthesized Rehearsal [49.24054920683246]
Large language models (LLMs) suffer from catastrophic forgetting during continual learning.
We propose a framework called Self-Synthesized Rehearsal (SSR) that uses the LLM to generate synthetic instances for rehearsal.
arXiv Detail & Related papers (2024-03-02T16:11:23Z) - Effective Large Language Model Adaptation for Improved Grounding and Citation Generation [48.07830615309543]
This paper focuses on improving large language models (LLMs) by grounding their responses in retrieved passages and by providing citations.
We propose a new framework, AGREE, that improves the grounding from a holistic perspective.
Our framework tunes LLMs to selfground the claims in their responses and provide accurate citations to retrieved documents.
arXiv Detail & Related papers (2023-11-16T03:22:25Z) - LM-Polygraph: Uncertainty Estimation for Language Models [71.21409522341482]
Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of large language models (LLMs)
We introduce LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python.
It introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores.
arXiv Detail & Related papers (2023-11-13T15:08:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.