Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers
- URL: http://arxiv.org/abs/2510.14381v1
- Date: Thu, 16 Oct 2025 07:28:54 GMT
- Title: Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers
- Authors: Andrew Zhao, Reshmi Ghosh, Vitor Carvalho, Emily Lawton, Keegan Hines, Gao Huang, Jack W. Stokes,
- Abstract summary: We present the first systematic analysis of poisoning risks in LLM-based prompt optimization.<n>We find systems are substantially more vulnerable to manipulated feedback than to injected queries.<n>We propose a lightweight highlighting defense that reduces the fake-reward $Delta$ASR from 0.23 to 0.07 without degrading utility.
- Score: 21.207996237794855
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language model (LLM) systems now underpin everyday AI applications such as chatbots, computer-use assistants, and autonomous robots, where performance often depends on carefully designed prompts. LLM-based prompt optimizers reduce that effort by iteratively refining prompts from scored feedback, yet the security of this optimization stage remains underexamined. We present the first systematic analysis of poisoning risks in LLM-based prompt optimization. Using HarmBench, we find systems are substantially more vulnerable to manipulated feedback than to injected queries: feedback-based attacks raise attack success rate (ASR) by up to $\Delta$ASR = 0.48. We introduce a simple fake-reward attack that requires no access to the reward model and significantly increases vulnerability, and we propose a lightweight highlighting defense that reduces the fake-reward $\Delta$ASR from 0.23 to 0.07 without degrading utility. These results establish prompt optimization pipelines as a first-class attack surface and motivate stronger safeguards for feedback channels and optimization frameworks.
Related papers
- ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z) - PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization [0.0]
This paper introduces a novel framework for hardening system prompts through shield appending.<n>We leverage an LLM-as-optimizer to search the space of possible SHIELDs, seeking to minimize a leakage metric derived from a suite of adversarial attacks.<n>We demonstrate empirically that our optimized SHIELDs significantly reduce prompt leakage against a comprehensive set of extraction attacks.
arXiv Detail & Related papers (2025-11-20T10:25:45Z) - AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema [39.44407870355891]
We propose AEGIS, an automated co-Evolutionary framework for Guarding prompt Injections.<n>Both attack and defense prompts are iteratively optimized against each other using a gradient-like natural language prompt optimization technique.<n>We evaluate our system on a real-world assignment grading dataset of prompt injection attacks and demonstrate that our method consistently outperforms existing baselines.
arXiv Detail & Related papers (2025-08-27T12:25:45Z) - Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models? [65.18157595903124]
This work investigates iterative approximate evaluation for arbitrary prompts.<n>It introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework.<n>MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced rollouts.
arXiv Detail & Related papers (2025-07-07T03:20:52Z) - Enhancing Security in LLM Applications: A Performance Evaluation of Early Detection Systems [1.03590082373586]
In prompt injection attacks, an attacker maliciously manipulates the system instructions, violating the system's confidentiality.<n>We investigated the capabilities of early prompt injection detection systems, focusing specifically on the detection performance of techniques implemented in various open-source solutions.<n>Our study presents analyzes of distinct prompt leakage detection techniques, and a comparative analysis of several detection solutions, which implement those techniques.
arXiv Detail & Related papers (2025-06-23T20:39:43Z) - Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts [28.043964124611026]
We take a novel approach to safeguard large language models (LLMs) by learning to adapt the system prompts in instruction-tuned LLMs.<n>We propose $textbfSysformer$, a trans$textbfformer$ model that updates an initial $textbfsys$tem prompt to a more robust system prompt in the LLM input embedding space.<n>We show that Sysformer can significantly enhance the robustness of LLMs, leading to upto $80%$ gain in the refusal rate on harmful prompts while enhancing the compliance with the safe prompts by upto $90%$
arXiv Detail & Related papers (2025-06-18T05:48:05Z) - AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z) - AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage.<n>We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility.<n> Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z) - Joint Optimization of Prompt Security and System Performance in Edge-Cloud LLM Systems [15.058369477125893]
Large language models (LLMs) have significantly facilitated human life, and prompt engineering has improved the efficiency of these models.<n>Recent years have witnessed a rise in prompt engineering-empowered attacks, leading to issues such as privacy leaks, increased latency, and system resource wastage.<n>We jointly consider prompt security, service latency, and system resource optimization in Edge-Cloud LLM (EC-LLM) systems under various prompt attacks.
arXiv Detail & Related papers (2025-01-30T14:33:49Z) - SecAlign: Defending Against Prompt Injection with Preference Optimization [52.48001255555192]
Adversarial prompts can be injected into external data sources to override the system's intended instruction and execute a malicious instruction.<n>We propose a new defense called SecAlign based on the technique of preference optimization.<n>Our method reduces the success rates of various prompt injections to 10%, even against attacks much more sophisticated than ones seen during training.
arXiv Detail & Related papers (2024-10-07T19:34:35Z) - AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs [51.217126257318924]
Large Language Models (LLMs) are vulnerable to jailbreaking attacks that lead to generation of inappropriate or harmful content.<n>We present a novel method that uses another LLM, called AdvPrompter, to generate human-readable adversarial prompts in seconds.
arXiv Detail & Related papers (2024-04-21T22:18:13Z) - Optimization-based Prompt Injection Attack to LLM-as-a-Judge [69.27584941296875]
LLM-as-a-Judge uses a large language model (LLM) to select the best response from a set of candidates for a given question.<n>We propose JudgeDeceiver, an optimization-based prompt injection attack to LLM-as-a-Judge.<n>Our evaluation shows that JudgeDeceive is highly effective, and is much more effective than existing prompt injection attacks.
arXiv Detail & Related papers (2024-03-26T13:58:00Z) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities.<n>Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content.<n>We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z) - Not what you've signed up for: Compromising Real-World LLM-Integrated
Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications.
We show how attackers can override original instructions and employed controls using Prompt Injection attacks.
We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.