Related papers: Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

URL: http://arxiv.org/abs/2602.11786v1
Date: Thu, 12 Feb 2026 10:09:13 GMT
Title: Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing
Authors: Keita Broadwater,
Abstract summary: We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by reliability engineering.<n>APST repeatedly samples identical prompts under controlled operational conditions to surface latent failure modes.<n>We find that models with similar benchmark-aligned scores can exhibit substantially different empirical failure rates under repeated sampling.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Traditional benchmarks for large language models (LLMs) primarily assess safety risk through breadth-oriented evaluation across diverse tasks. However, real-world deployment exposes a different class of risk: operational failures arising from repeated inference on identical or near-identical prompts rather than broad task generalization. In high-stakes settings, response consistency and safety under sustained use are critical. We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by reliability engineering. APST repeatedly samples identical prompts under controlled operational conditions (e.g., decoding temperature) to surface latent failure modes including hallucinations, refusal inconsistency, and unsafe completions. Rather than treating failures as isolated events, APST models them as stochastic outcomes of independent inference events. We formalize safety failures using Bernoulli and binomial models to estimate per-inference failure probabilities, enabling quantitative comparison of reliability across models and decoding configurations. Applying APST to multiple instruction-tuned LLMs evaluated on AIR-BENCH-derived safety prompts, we find that models with similar benchmark-aligned scores can exhibit substantially different empirical failure rates under repeated sampling, particularly as temperature increases. These results demonstrate that shallow, single-sample evaluation can obscure meaningful reliability differences under sustained use. APST complements existing benchmarks by providing a practical framework for evaluating LLM safety and reliability under repeated inference, bridging benchmark alignment and deployment-oriented risk assessment.

Related papers

CausalCompass: Evaluating the Robustness of Time-Series Causal Discovery in Misspecified Scenarios [17.11442807888366]
Causal is a benchmark suite designed to assess the robustness of time-series causal discovery methods under violations of modeling assumptions.<n>We conduct extensive benchmarking of representative TSCD algorithms across eight assumption-violation scenarios.<n>The methods exhibiting superior overall performance across diverse scenarios are almost deep learning-based approaches.
arXiv Detail & Related papers (2026-02-08T11:27:06Z)
NegBLEURT Forest: Leveraging Inconsistencies for Detecting Jailbreak Attacks [8.416892421891761]
Jailbreak attacks designed to bypass safety mechanisms pose a serious threat by prompting LLMs to generate harmful or inappropriate content, despite alignment with ethical guidelines.<n>This work introduces a semantic consistency analysis between successful and unsuccessful responses, demonstrating that a negation-aware scoring approach captures meaningful patterns.<n>A novel detection framework called NegBLEURT Forest is proposed to evaluate the degree of alignment between outputs elicited by adversarial prompts and expected safe behaviors.<n>It identifies anomalous responses using the Isolation Forest algorithm, enabling reliable jailbreak detection.
arXiv Detail & Related papers (2025-11-14T14:43:54Z)
Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs [10.896368527058714]
Large language models (LLMs) frequently produce false refusals, declining benign requests that contain terms resembling unsafe queries.<n>We introduce two comprehensive benchmarks: the Exaggerated Safety Benchmark (XSB) for single-turn prompts, annotated with "Focus" keywords that identify refusal-inducing triggers, and the Multi-turn Scenario-based Exaggerated Safety Benchmark (MS-XSB)<n>Our benchmarks reveal that exaggerated refusals persist across diverse recent LLMs and are especially pronounced in complex, multi-turn scenarios.
arXiv Detail & Related papers (2025-10-09T12:38:16Z)
Towards Reliable LLM-based Robot Planning via Combined Uncertainty Estimation [68.106428321492]
Large language models (LLMs) demonstrate advanced reasoning abilities, enabling robots to understand natural language instructions and generate high-level plans with appropriate grounding.<n>LLMs hallucinations present a significant challenge, often leading to overconfident yet potentially misaligned or unsafe plans.<n>We present Combined Uncertainty estimation for Reliable Embodied planning (CURE), which decomposes the uncertainty into epistemic and intrinsic uncertainty, each estimated separately.
arXiv Detail & Related papers (2025-10-09T10:26:58Z)
FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning [62.452350134196934]
FaithCoT-Bench is a unified benchmark for instance-level CoT unfaithfulness detection.<n>Our framework formulates unfaithfulness detection as a discriminative decision problem.<n>FaithCoT-Bench sets a solid basis for future research toward more interpretable and trustworthy reasoning in LLMs.
arXiv Detail & Related papers (2025-10-05T05:16:54Z)
SVeritas: Benchmark for Robust Speaker Verification under Diverse Conditions [54.34001921326444]
Speaker verification (SV) models are increasingly integrated into security, personalization, and access control systems.<n>Existing benchmarks evaluate only subsets of these conditions, missing others entirely.<n>We introduce SVeritas, a comprehensive Speaker Verification tasks benchmark suite, assessing SV systems under stressors like recording duration, spontaneity, content, noise, microphone distance, reverberation, channel mismatches, audio bandwidth, codecs, speaker age, and susceptibility to spoofing and adversarial attacks.
arXiv Detail & Related papers (2025-09-21T14:11:16Z)
ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning [64.93140713419561]
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs.<n>Existing fine-tuning-based compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection.<n>We introduce ConCISE, a framework designed to generate concise reasoning chains, integrating Confidence Injection to boost reasoning confidence, and Early Stopping to terminate reasoning when confidence is sufficient.
arXiv Detail & Related papers (2025-05-08T01:40:40Z)
On the Need for a Statistical Foundation in Scenario-Based Testing of Autonomous Vehicles [4.342427756164555]
This paper argues that a rigorous statistical foundation is essential to address these challenges and enable rigorous safety assurance.<n>By drawing parallels between AV testing and established software testing methods, we identify shared research gaps and reusable solutions.<n>Our analysis reveals that neither scenario-based nor mile-based testing universally outperforms the other.
arXiv Detail & Related papers (2025-05-04T22:06:23Z)
PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z)
Adversarial Robustness Overestimation and Instability in TRADES [4.063518154926961]
TRADES sometimes yields disproportionately high PGD validation accuracy compared to the AutoAttack testing accuracy in the multiclass classification task. This discrepancy highlights a significant overestimation of robustness for these instances, potentially linked to gradient masking.
arXiv Detail & Related papers (2024-10-10T07:32:40Z)
Safe Deployment for Counterfactual Learning to Rank with Exposure-Based Risk Minimization [63.93275508300137]
We introduce a novel risk-aware Counterfactual Learning To Rank method with theoretical guarantees for safe deployment. Our experimental results demonstrate the efficacy of our proposed method, which is effective at avoiding initial periods of bad performance when little data is available.
arXiv Detail & Related papers (2023-04-26T15:54:23Z)
Probabilities Are Not Enough: Formal Controller Synthesis for Stochastic Dynamical Models with Epistemic Uncertainty [68.00748155945047]
Capturing uncertainty in models of complex dynamical systems is crucial to designing safe controllers. Several approaches use formal abstractions to synthesize policies that satisfy temporal specifications related to safety and reachability. Our contribution is a novel abstraction-based controller method for continuous-state models with noise, uncertain parameters, and external disturbances.
arXiv Detail & Related papers (2022-10-12T07:57:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.