Related papers: Know Thy Enemy: Securing LLMs Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning

Know Thy Enemy: Securing LLMs Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning

URL: http://arxiv.org/abs/2601.04666v1
Date: Thu, 08 Jan 2026 07:25:27 GMT
Title: Know Thy Enemy: Securing LLMs Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning
Authors: Zhiyuan Chang, Mingyang Li, Yuekai Huang, Ziyou Jiang, Xiaojun Jia, Qian Xiong, Junjie Wang, Zhaoyang Li, Qing Wang,
Abstract summary: Large language model (LLM)-integrated applications have become increasingly prevalent, yet face critical security vulnerabilities from prompt injection (PI) attacks.<n>We propose InstruCoT, a model enhancement method for PI defense that synthesizes diverse training data and employs instruction-level chain-of-thought fine-tuning.
Score: 31.790490397086856
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language model (LLM)-integrated applications have become increasingly prevalent, yet face critical security vulnerabilities from prompt injection (PI) attacks. Defending against PI attacks faces two major issues: malicious instructions can be injected through diverse vectors, and injected instructions often lack clear semantic boundaries from the surrounding context, making them difficult to identify. To address these issues, we propose InstruCoT, a model enhancement method for PI defense that synthesizes diverse training data and employs instruction-level chain-of-thought fine-tuning, enabling LLMs to effectively identify and reject malicious instructions regardless of their source or position in the context. We evaluate InstruCoT across three critical dimensions: Behavior Deviation, Privacy Leakage, and Harmful Output. Experimental results across four LLMs demonstrate that InstruCoT significantly outperforms baselines in all dimensions while maintaining utility performance without degradation

Related papers

TopicAttack: An Indirect Prompt Injection Attack via Topic Transition [92.26240528996443]
Large language models (LLMs) are vulnerable to indirect prompt injection attacks.<n>We propose TopicAttack, which prompts the LLM to generate a fabricated transition prompt that gradually shifts the topic toward the injected instruction.<n>We find that a higher injected-to-original attention ratio leads to a greater success probability, and our method achieves a much higher ratio than the baseline methods.
arXiv Detail & Related papers (2025-07-18T06:23:31Z)
Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities [76.9327488986162]
Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images.<n>We exploit the capabilities of MLLMs to interpret non-textual instructions, specifically, adversarial images or audio generated by our novel method, Con Instruction.<n>Our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVA-v1.5 (13B)
arXiv Detail & Related papers (2025-05-31T13:11:14Z)
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction [68.6543680065379]
Large language models (LLMs) are vulnerable to prompt injection attacks.<n>We propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-04-29T07:13:53Z)
Stepwise Reasoning Error Disruption Attack of LLMs [34.30455975290165]
Existing attacks on large language models (LLMs) are constrained by specific settings or lack of imperceptibility.<n>We propose the Stepwise rEasoning Error Disruption (SEED) attack, which subtly injects errors into prior reasoning steps to mislead the model into producing incorrect subsequent reasoning and final answers.
arXiv Detail & Related papers (2024-12-16T16:20:41Z)
What You See Is Not Always What You Get: An Empirical Study of Code Comprehension by Large Language Models [0.5735035463793009]
We investigate the vulnerability of large language models (LLMs) to imperceptible attacks, where hidden character manipulation in source code misleads LLMs' behaviour while remaining undetectable to human reviewers.<n>These attacks include coding reordering, invisible coding characters, code deletions, and code homoglyphs.<n>Our findings confirm the susceptibility of LLMs to imperceptible coding character attacks, while different LLMs present different negative correlations between perturbation magnitude and performance.
arXiv Detail & Related papers (2024-12-11T04:52:41Z)
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy [53.54777131440989]
Large Language Models (LLMs) are susceptible to security and safety threats.<n>One major cause of these vulnerabilities is the lack of an instruction hierarchy.<n>We introduce the instructional segment Embedding (ISE) technique, inspired by BERT, to modern large language models.
arXiv Detail & Related papers (2024-10-09T12:52:41Z)
SecAlign: Defending Against Prompt Injection with Preference Optimization [52.48001255555192]
Adversarial prompts can be injected into external data sources to override the system's intended instruction and execute a malicious instruction.<n>We propose a new defense called SecAlign based on the technique of preference optimization.<n>Our method reduces the success rates of various prompt injections to 10%, even against attacks much more sophisticated than ones seen during training.
arXiv Detail & Related papers (2024-10-07T19:34:35Z)
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities.<n>Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content.<n>We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z)
Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection [70.28425745910711]
Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following. This capability brings with it the risk of prompt injection attacks. We evaluate the robustness of instruction-following LLMs against such attacks.
arXiv Detail & Related papers (2023-08-17T06:21:50Z)
Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting [55.15697111170836]
This paper reveals the behaviors of large language models (LLMs) towards textitinductive instructions and enhance their truthfulness and helpfulness accordingly. After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions. We identify that different inductive styles affect the models' ability to identify the same underlying errors, and the complexity of the underlying assumptions also influences the model's performance.
arXiv Detail & Related papers (2023-05-23T06:38:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.