Minimal Prompt Perturbations Lead to Code Vulnerabilities: Prompt Fragility and Hidden-State Signals in Coding LLMs
Abstract Overview
Using the CWEval benchmark, this paper examines whether very small prompt perturbations can alter the security of code generated by coding LLMs. The authors test three open coding models across five programming languages with single-character, three-character, and token-level prompt mutations, and evaluate both functionality and joint functionality-plus-security. They find that even a single-character change can flip generated code from secure to vulnerable, with the impact varying across languages, CWE categories, token positions, and exact mutations. They also probe hidden states at the prompt’s last token and show that some vulnerability outcomes are already partially encoded before generation, although the signal is uneven across vulnerability types.
Novelty
The paper extends prior prompt-robustness work from functional correctness to code security, showing that ordinary prompt variation—not only adversarial prompting—can introduce vulnerabilities. It also combines mutation analysis with hidden-state probing and a per-CWE breakdown, distinguishing between input-handling and secure-defaults vulnerability patterns.
Results
Across both breadth and intensity analyses, prompt mutations affected many CWE categories and more often harmed than improved outcomes; by effect size, Qwen3-Coder was more robust than CodeLlama and DeepSeek-Coder. Hidden-state probes for the joint functional-and-secure target achieved about 0.70 mean AUC overall, with higher predictability for input-handling vulnerabilities (mean AUC 0.753) than for secure-defaults vulnerabilities (mean AUC 0.674; one-sided Mann-Whitney p=0.009). The study also shows that some outcome flips are driven mainly by mutation position, while others depend more on the specific token change.
Key Points
- Single-character prompt edits can change LLM-generated code from secure to vulnerable, even when the perturbation is minimal.
- Mutation effects are not uniform: they differ by model, language, CWE, token position, and whether the exact substitution alters a security-critical part of the prompt.
- Prompt-end hidden states contain a usable but uneven security signal, with stronger predictability for vulnerabilities that require adding validation or sanitization than for those determined by a local secure-default choice.