Prompt Stability in Code LLMs: Measuring Sensitivity across Emotion- and Personality-Driven Variations
- URL: http://arxiv.org/abs/2509.13680v1
- Date: Wed, 17 Sep 2025 04:17:42 GMT
- Title: Prompt Stability in Code LLMs: Measuring Sensitivity across Emotion- and Personality-Driven Variations
- Authors: Wei Ma, Yixiao Yang, Jingquan Ge, Xiaofei Xie, Lingxiao Jiang,
- Abstract summary: We present PromptSE, a framework that creates semantically equivalent prompt variants with emotion and personality templates.<n>Our study shows that performance and stability behave as largely decoupled optimization objectives.<n>PromptSE enables practitioners to quantify performance stability trade offs for deployment and model selection.
- Score: 40.12950482269347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code generation models are widely used in software development, yet their sensitivity to prompt phrasing remains under-examined. Identical requirements expressed with different emotions or communication styles can yield divergent outputs, while most benchmarks emphasize only peak performance. We present PromptSE (Prompt Sensitivity Evaluation), a framework that creates semantically equivalent prompt variants with emotion and personality templates, and that evaluates stability using probability aware continuous scoring or using binary pass rates when logits are unavailable. The results are aggregated into a proposed area under curve metric (AUC-E) for cross model comparison. Across 14 models from three families (Llama, Qwen, and DeepSeek), our study shows that performance and stability behave as largely decoupled optimization objectives, and it reveals architectural and scale related patterns that challenge common assumptions about model robustness. The framework supports rapid screening for closed-source models as well as detailed stability analysis in research settings. PromptSE enables practitioners to quantify performance stability trade offs for deployment and model selection, positioning prompt stability as a complementary evaluation dimension alongside performance and fairness, and contributing to more trustworthy AI-assisted software development tools.
Related papers
- Can Causality Cure Confusion Caused By Correlation (in Software Analytics)? [4.082216579462797]
Symbolic models, particularly decision trees, are widely used in software engineering for explainable analytics.<n>Recent studies in software engineering show that both correlational models and causal discovery algorithms suffer from pronounced instability.<n>This study investigates causality-aware split criteria into symbolic models to improve their stability and robustness.
arXiv Detail & Related papers (2026-02-17T23:35:50Z) - AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition [72.24180896265192]
We introduce AgentNoiseBench, a framework for evaluating robustness of agentic models under noisy environments.<n>We first conduct an in-depth analysis of biases and uncertainties in real-world scenarios.<n>We then categorize environmental noise into two primary types: user-noise and tool-noise.<n>Building on this analysis, we develop an automated pipeline that injects controllable noise into existing agent-centric benchmarks.
arXiv Detail & Related papers (2026-02-11T20:33:10Z) - Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models [52.48582333951919]
We propose a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates.<n>SAGE (Stability-Aware Gradient Efficiency) integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines.
arXiv Detail & Related papers (2026-02-01T12:56:10Z) - Trust in One Round: Confidence Estimation for Large Language Models via Structural Signals [13.89434979851652]
Large language models (LLMs) are increasingly deployed in domains where errors carry high social, scientific, or safety costs.<n>We present Structural Confidence, a single-pass, model-agnostic framework that enhances output correctness prediction.
arXiv Detail & Related papers (2026-02-01T02:35:59Z) - Q-Save: Towards Scoring and Attribution for Generated Video Evaluation [65.83319736145869]
We present Q-Save, a new benchmark dataset and model for holistic evaluation of AI-generated video (AIGV) quality.<n>The dataset contains near 10000 videos, each annotated with a scalar mean opinion score (MOS) and fine-grained attribution labels.<n>We propose a unified evaluation model that jointly performs quality scoring and attribution-based explanation.
arXiv Detail & Related papers (2025-11-24T07:00:21Z) - Stability Under Scrutiny: Benchmarking Representation Paradigms for Online HD Mapping [25.516502412129096]
This paper presents the first comprehensive benchmark for evaluating the temporal stability of online HD mapping models.<n>We propose a multi-dimensional stability evaluation framework with novel metrics for Presence, Localization, and Shape Stability.<n>Our work highlights the importance of treating temporal stability as a core evaluation criterion alongside accuracy, advancing the development of more reliable autonomous driving systems.
arXiv Detail & Related papers (2025-10-12T15:33:45Z) - SALMAN: Stability Analysis of Language Models Through the Maps Between Graph-based Manifolds [11.373585987937913]
We propose a unified, local (sample-level) robustness framework (SALMAN) that evaluates model stability without modifying internal parameters or resorting to complex perturbations.<n>Central to our approach is a novel Distance Mapping Distortion (DMD) measure, which ranks each sample's susceptibility by comparing input-to-output distance mappings in a near-linear manner.<n>By demonstrating significant gains in attack efficiency and robust training, we position our framework as a practical, model-agnostic tool for advancing the reliability of transformer-based NLP systems.
arXiv Detail & Related papers (2025-08-23T02:50:55Z) - Stochastic Encodings for Active Feature Acquisition [100.47043816019888]
Active Feature Acquisition is an instance-wise, sequential decision making problem.<n>The aim is to dynamically select which feature to measure based on current observations, independently for each test instance.<n>Common approaches either use Reinforcement Learning, which experiences training difficulties, or greedily maximize the conditional mutual information of the label and unobserved features, which makes myopic.<n>We introduce a latent variable model, trained in a supervised manner. Acquisitions are made by reasoning about the features across many possible unobserved realizations in a latent space.
arXiv Detail & Related papers (2025-08-03T23:48:46Z) - RoHOI: Robustness Benchmark for Human-Object Interaction Detection [78.18946529195254]
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support.<n>We introduce the first benchmark for HOI detection, evaluating model resilience under diverse challenges.<n>Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric.
arXiv Detail & Related papers (2025-07-12T01:58:04Z) - Test-Time Consistency in Vision Language Models [26.475993408532304]
Vision-Language Models (VLMs) have achieved impressive performance across a wide range of multimodal tasks.<n>Recent benchmarks, such as MM-R3, highlight that even state-of-the-art VLMs can produce divergent predictions across semantically equivalent inputs.<n>We propose a simple and effective test-time consistency framework that enhances semantic consistency without supervised re-training.
arXiv Detail & Related papers (2025-06-27T17:09:44Z) - Statistical Runtime Verification for LLMs via Robustness Estimation [0.0]
Adversarial robustness verification is essential for ensuring the safe deployment of Large Language Models (LLMs) in runtime-critical applications.<n>This paper presents a case study adapting and extending the RoMA statistical verification framework to assess its feasibility as an online runtime robustness monitor for LLMs in black-box deployment settings.
arXiv Detail & Related papers (2025-04-24T16:36:19Z) - Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models [13.216398753024182]
Large Language Models (LLMs) and Vision-Language Models (VLMs) have achieved impressive performance across a wide range of tasks.<n>In this study, we seek to pinpoint the sources of this fragility by identifying parameters and input dimensions that are susceptible to such perturbations.<n>We propose a stability measure called textbfFI, textbfFirst order local textbfInfluence, which is rooted in information geometry and quantifies the sensitivity of individual parameter and input dimensions.
arXiv Detail & Related papers (2025-03-28T16:23:59Z) - On the Robustness of Aspect-based Sentiment Analysis: Rethinking Model,
Data, and Training [109.9218185711916]
Aspect-based sentiment analysis (ABSA) aims at automatically inferring the specific sentiment polarities toward certain aspects of products or services behind social media texts or reviews.
We propose to enhance the ABSA robustness by systematically rethinking the bottlenecks from all possible angles, including model, data, and training.
arXiv Detail & Related papers (2023-04-19T11:07:43Z) - Model Stability with Continuous Data Updates [2.439909645714735]
We study the "stability" of machine learning (ML) models within the context of larger, complex NLP systems.
We find that model design choices, including network architecture and input representation, have a critical impact on stability.
We recommend ML model designers account for trade-offs in accuracy and jitter when making modeling choices.
arXiv Detail & Related papers (2022-01-14T22:11:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.