Related papers: PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines

PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines

URL: http://arxiv.org/abs/2504.14738v1
Date: Sun, 20 Apr 2025 21:04:23 GMT
Title: PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines
Authors: Reya Vir, Shreya Shankar, Harrison Chase, Will Fu-Hinthorn, Aditya Parameswaran,
Abstract summary: Large language models (LLMs) are increasingly deployed in specialized production data processing pipelines across diverse domains.<n>To improve reliability in these applications, creating assertions or guardrails for LLM outputs to run alongside the pipelines is essential.<n>In this paper, we introduce PROMPTEVALS, a dataset of 2087 pipeline prompts with 12623 corresponding assertion criteria.
Score: 0.8148009849453334
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly deployed in specialized production data processing pipelines across diverse domains -- such as finance, marketing, and e-commerce. However, when running them in production across many inputs, they often fail to follow instructions or meet developer expectations. To improve reliability in these applications, creating assertions or guardrails for LLM outputs to run alongside the pipelines is essential. Yet, determining the right set of assertions that capture developer requirements for a task is challenging. In this paper, we introduce PROMPTEVALS, a dataset of 2087 LLM pipeline prompts with 12623 corresponding assertion criteria, sourced from developers using our open-source LLM pipeline tools. This dataset is 5x larger than previous collections. Using a hold-out test split of PROMPTEVALS as a benchmark, we evaluated closed- and open-source models in generating relevant assertions. Notably, our fine-tuned Mistral and Llama 3 models outperform GPT-4o by 20.93% on average, offering both reduced latency and improved performance. We believe our dataset can spur further research in LLM reliability, alignment, and prompt engineering.

Related papers

Large Language Models Can Self-Improve in Long-context Reasoning [100.52886241070907]
Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning. We propose ours, an approach specifically designed for this purpose. ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models.
arXiv Detail & Related papers (2024-11-12T19:53:00Z)
DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing [10.712756715779822]
Large Language Models (LLMs) have shown promise in data processing. These frameworks focus on reducing cost when executing user-specified operations. This is problematic for complex tasks and data. We present DocETL, a system that optimize complex document processing pipelines.
arXiv Detail & Related papers (2024-10-16T03:22:35Z)
Revisiting VerilogEval: A Year of Improvements in Large-Language Models for Hardware Code Generation [6.463959200930805]
We evaluate new commercial and open models since the release of the open-source VerilogEval benchmark.<n>We find measurable improvements in state-of-the-art models.<n>We find that prompt engineering remains crucial for achieving good pass rates.
arXiv Detail & Related papers (2024-08-20T17:58:56Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z)
TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data [73.29220562541204]
We consider harnessing the amazing power of language models (LLMs) to solve our task. We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z)
SPADE: Synthesizing Data Quality Assertions for Large Language Model Pipelines [15.389579061898429]
We present SPADE, a method for automatically synthesizing data quality assertions. In testing across nine different real-world LLM pipelines, SPADE efficiently reduces the number of assertions by 14%.
arXiv Detail & Related papers (2024-01-05T19:27:58Z)
FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning [70.38817963253034]
This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution. We provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios. We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings.
arXiv Detail & Related papers (2023-09-01T09:40:36Z)
MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data. For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.