Prompt Stability Matters: Evaluating and Optimizing Auto-Generated Prompt in General-Purpose Systems
- URL: http://arxiv.org/abs/2505.13546v1
- Date: Mon, 19 May 2025 03:28:33 GMT
- Title: Prompt Stability Matters: Evaluating and Optimizing Auto-Generated Prompt in General-Purpose Systems
- Authors: Ke Chen, Yufei Zhou, Xitong Zhang, Haohan Wang,
- Abstract summary: We introduce semantic stability as a criterion for assessing the response consistency of model responses.<n>We develop the first stability-aware general-purpose prompt generation system.<n>Our work offers a new perspective on prompt design and contributes practical tools for building more trustworthy general-purpose systems.
- Score: 19.59294293070619
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic prompt generation plays a crucial role in enabling general-purpose multi-agent systems to perform diverse tasks autonomously. Existing methods typically evaluate prompts based on their immediate task performance, overlooking the intrinsic qualities that determine their reliability. This outcome-centric view not only limits interpretability but also fails to account for the inherent stochasticity of large language models (LLMs). In this work, we bring attention to prompt stability-the consistency of model responses across repeated executions-as a key factor for building robust and effective prompt generation systems. To quantify this, we propose semantic stability as a criterion for assessing the response consistency of prompts, and fine-tune a LLaMA-based evaluator to measure it automatically across tasks. These components have enabled us to develop the first stability-aware general-purpose prompt generation system that leverages stability feedback to iteratively enhance both prompt quality and system-level performance. Furthermore, we establish a logical chain between prompt stability and task success by analyzing the structural dependencies within our system, proving stability as a necessary condition for effective system-level execution. Empirical results across general and domain-specific tasks demonstrate that our stability-aware framework improves both accuracy and output consistency. By shifting the focus from one-off results to persistent reliability, our work offers a new perspective on prompt design and contributes practical tools for building more trustworthy general-purpose systems.
Related papers
- Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration [48.19579266939883]
Diffusion large language models (dLLMs) have attracted significant attention for their ability to enhance diversity, controllability, and parallelism.<n>We propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs.
arXiv Detail & Related papers (2026-03-03T08:58:20Z) - ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning [75.73135757250806]
Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks.<n>Despite encouraging early results, ARL remains highly unstable, often leading to training collapse.<n>In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting.
arXiv Detail & Related papers (2026-02-25T03:43:34Z) - Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness [4.129847064263056]
We systematically evaluate the performance of Large Language Models for rubric-based short-answer grading.<n>We find that alignment is strong for binary tasks but degrades with increased rubric granularity.<n>Experiments reveal that while the model is resilient to prompt injection, it is sensitive to synonym substitutions.
arXiv Detail & Related papers (2025-12-21T05:22:04Z) - OmniQuality-R: Advancing Reward Models Through All-Encompassing Quality Assessment [55.59322229889159]
We propose OmniQuality-R, a unified reward modeling framework that transforms multi-task quality reasoning into continuous and interpretable reward signals.<n>We use a reasoning-enhanced reward modeling dataset to form a reliable chain-of-thought dataset for supervised fine-tuning.<n>We evaluate OmniQuality-R on three key IQA tasks: aesthetic quality assessment, technical quality evaluation, and text-image alignment.
arXiv Detail & Related papers (2025-10-12T13:46:28Z) - Prompt Stability in Code LLMs: Measuring Sensitivity across Emotion- and Personality-Driven Variations [40.12950482269347]
We present PromptSE, a framework that creates semantically equivalent prompt variants with emotion and personality templates.<n>Our study shows that performance and stability behave as largely decoupled optimization objectives.<n>PromptSE enables practitioners to quantify performance stability trade offs for deployment and model selection.
arXiv Detail & Related papers (2025-09-17T04:17:42Z) - An Empirical Analysis of VLM-based OOD Detection: Mechanisms, Advantages, and Sensitivity [104.05991573442805]
Vision-Language Models (VLMs) have demonstrated remarkable zero-shot out-of-distribution (OOD) detection capabilities.<n>This paper presents a systematic empirical analysis of VLM-based OOD detection using in-distribution (ID) and OOD prompts.
arXiv Detail & Related papers (2025-09-16T06:11:02Z) - Aurora: Are Android Malware Classifiers Reliable and Stable under Distribution Shift? [51.12297424766236]
AURORA is a framework to evaluate malware classifiers based on their confidence quality and operational resilience.<n>AURORA is complemented by a set of metrics designed to go beyond point-in-time performance.<n>The fragility in SOTA frameworks across datasets of varying drift suggests the need for a return to the whiteboard.
arXiv Detail & Related papers (2025-05-28T20:22:43Z) - Re-evaluation of Logical Specification in Behavioural Verification [0.0]
This study empirically validates automated logical specification methods for behavioural models.<n>We identify performance irregularities that suggest the need for adaptive performance irregularities in automated reasoning.<n>Addressing these inefficiencies through self-optimising solvers could enhance the stability of automated reasoning.
arXiv Detail & Related papers (2025-05-23T14:46:39Z) - Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards [67.86091419220816]
Large Language Models (LLMs) show great promise in complex reasoning.<n>A prevalent issue is superficial self-reflection'', where models fail to robustly verify their own outputs.<n>We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this.
arXiv Detail & Related papers (2025-05-19T17:59:31Z) - Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation [52.83870601473094]
Embodied agents exhibit immense potential across a multitude of domains.<n>Existing research predominantly concentrates on the security of general large language models.<n>This paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents.
arXiv Detail & Related papers (2025-04-22T08:34:35Z) - Probabilistic Stability Guarantees for Feature Attributions [20.58023369482214]
We propose a model-agnostic, sample-efficient stability certification algorithm (SCA) that yields non-trivial and interpretable guarantees for attribution methods.<n>We show that mild smoothing achieves a more favorable trade-off between accuracy and stability, avoiding the aggressive compromises made in prior certification methods.
arXiv Detail & Related papers (2025-04-18T16:39:08Z) - Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions [8.069858557211132]
Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent performance across multiple interaction rounds.<n>This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions.
arXiv Detail & Related papers (2025-03-28T11:49:56Z) - Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [48.15636223774418]
Large language models (LLMs) frequently hallucinate due to misaligned self-awareness.<n>Existing approaches mitigate hallucinations via uncertainty estimation or query rejection.<n>We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems.
arXiv Detail & Related papers (2025-03-04T03:16:02Z) - Trustworthiness for an Ultra-Wideband Localization Service [2.4979362117484714]
This paper proposes a holistic trustworthiness assessment framework for ultra-wideband self-localization.
Our goal is to provide guidance for evaluating a system's trustworthiness based on objective evidence.
Our approach guarantees that the resulting trustworthiness indicators correspond to chosen real-world threats.
arXiv Detail & Related papers (2024-08-10T11:57:10Z) - Stability-Certified Learning of Control Systems with Quadratic
Nonlinearities [9.599029891108229]
This work primarily focuses on an operator inference methodology aimed at constructing low-dimensional dynamical models.
Our main objective is to develop a method that facilitates the inference of quadratic control dynamical systems with inherent stability guarantees.
arXiv Detail & Related papers (2024-03-01T16:26:47Z) - Algorithmic Robustness [18.406992961818368]
Robustness is an important enabler of other goals that are frequently cited in the context of public policy decisions about computational systems.
This document provides a brief roadmap to some of the concepts and existing research around the idea of algorithmic robustness.
arXiv Detail & Related papers (2023-10-17T17:51:12Z) - ASSERT: Automated Safety Scenario Red Teaming for Evaluating the
Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection.
We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance.
We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z) - Efficient Empowerment Estimation for Unsupervised Stabilization [75.32013242448151]
empowerment principle enables unsupervised stabilization of dynamical systems at upright positions.
We propose an alternative solution based on a trainable representation of a dynamical system as a Gaussian channel.
We show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images.
arXiv Detail & Related papers (2020-07-14T21:10:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.