Related papers: BadTemplate: A Training-Free Backdoor Attack via Chat Template Against Large Language Models

BadTemplate: A Training-Free Backdoor Attack via Chat Template Against Large Language Models

URL: http://arxiv.org/abs/2602.05401v1
Date: Thu, 05 Feb 2026 07:34:35 GMT
Title: BadTemplate: A Training-Free Backdoor Attack via Chat Template Against Large Language Models
Authors: Zihan Wang, Hongwei Li, Rui Zhang, Wenbo Jiang, Guowen Xu,
Abstract summary: Chat template is a technique used in the training and inference stages of Large Language Models.<n>BadTemplate is a training-free backdoor attack that inserts malicious instructions directly into the system prompt.<n>It achieves up to a 100% attack success rate and significantly outperforms traditional prompt-based backdoors in both word-level and sentence-level attacks.
Score: 28.29280600867087
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Chat template is a common technique used in the training and inference stages of Large Language Models (LLMs). It can transform input and output data into role-based and templated expressions to enhance the performance of LLMs. However, this also creates a breeding ground for novel attack surfaces. In this paper, we first reveal that the customizability of chat templates allows an attacker who controls the template to inject arbitrary strings into the system prompt without the user's notice. Building on this, we propose a training-free backdoor attack, termed BadTemplate. Specifically, BadTemplate inserts carefully crafted malicious instructions into the high-priority system prompt, thereby causing the target LLM to exhibit persistent backdoor behaviors. BadTemplate outperforms traditional backdoor attacks by embedding malicious instructions directly into the system prompt, eliminating the need for model retraining while achieving high attack effectiveness with minimal cost. Furthermore, its simplicity and scalability make it easily and widely deployed in real-world systems, raising serious risks of rapid propagation, economic damage, and large-scale misinformation. Furthermore, detection by major third-party platforms HuggingFace and LLM-as-a-judge proves largely ineffective against BadTemplate. Extensive experiments conducted on 5 benchmark datasets across 6 open-source and 3 closed-source LLMs, compared with 3 baselines, demonstrate that BadTemplate achieves up to a 100% attack success rate and significantly outperforms traditional prompt-based backdoors in both word-level and sentence-level attacks. Our work highlights the potential security risks raised by chat templates in the LLM supply chain, thereby supporting the development of effective defense mechanisms.

Related papers

Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates [3.823638706744939]
Chat templates are executable Jinja2 programs invoked at every inference call.<n>We show that an adversary who distributes a model with a maliciously modified template can implant an inference-time backdoor.<n>Backdoors generalize across inference runtimes and evade all automated security scans applied by the largest open-weight distribution platform.
arXiv Detail & Related papers (2026-02-04T15:28:53Z)
Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution [49.78359632298156]
Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks.<n> backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated.<n>We present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution.
arXiv Detail & Related papers (2025-08-28T17:05:18Z)
Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models [79.36881186707413]
Multi-modal large language models (MLLMs) process multi-modal information, enabling them to generate responses to image-text inputs.<n> MLLMs have been incorporated into diverse multi-modal applications, such as autonomous driving and medical diagnosis, via plug-and-play without fine-tuning.<n>We propose BadToken, the first token-level backdoor attack to MLLMs.
arXiv Detail & Related papers (2025-03-20T10:39:51Z)
ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models [55.93380086403591]
Generative large language models are vulnerable to backdoor attacks.<n>$textitELBA-Bench$ allows attackers to inject backdoor through parameter efficient fine-tuning.<n>$textitELBA-Bench$ provides over 1300 experiments.
arXiv Detail & Related papers (2025-02-22T12:55:28Z)
ASPIRER: Bypassing System Prompts With Permutation-based Backdoors in LLMs [17.853862145962292]
We introduce a novel backdoor attack that systematically bypasses system prompts. Our method achieves an attack success rate (ASR) of up to 99.50% while maintaining a clean accuracy (CACC) of 98.58%.
arXiv Detail & Related papers (2024-10-05T02:58:20Z)
Large Language Models are Good Attackers: Efficient and Stealthy Textual Backdoor Attacks [10.26810397377592]
We introduce the Efficient and Stealthy Textual backdoor attack method, EST-Bad, leveraging Large Language Models (LLMs) Our EST-Bad encompasses three core strategies: optimizing the inherent flaw of models as the trigger, stealthily injecting triggers with LLMs, and meticulously selecting the most impactful samples for backdoor injection.
arXiv Detail & Related papers (2024-08-21T12:50:23Z)
MEGen: Generative Backdoor into Large Language Models via Model Editing [36.67048791892558]
This paper focuses on the impact of backdoored large language models (LLMs)<n>We propose an editing-based generative backdoor, named MEGen, aiming to expand the backdoor to generative tasks.<n>Experiments show that MEGen achieves a high attack success rate by adjusting only a small set of local parameters.
arXiv Detail & Related papers (2024-08-20T10:44:29Z)
TrojFM: Resource-efficient Backdoor Attacks against Very Large Foundation Models [69.37990698561299]
TrojFM is a novel backdoor attack tailored for very large foundation models. Our approach injects backdoors by fine-tuning only a very small proportion of model parameters. We demonstrate that TrojFM can launch effective backdoor attacks against widely used large GPT-style models.
arXiv Detail & Related papers (2024-05-27T03:10:57Z)
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks. Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z)
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models [15.381273199132433]
BadChain is the first backdoor attack against large language models (LLMs) employing chain-of-thought (COT) prompting. We show the effectiveness of BadChain for two COT strategies and six benchmark tasks. BadChain remains a severe threat to LLMs, underscoring the urgency for the development of robust and effective future defenses.
arXiv Detail & Related papers (2024-01-20T04:53:35Z)
TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4 [15.015584291919817]
We propose a novel approach of TARGET (Template-trAnsfeRable backdoor attack aGainst prompt-basEd NLP models via GPT4) Specifically, we first utilize GPT4 to reformulate manual templates to generate tone-strong and normal templates, and the former are injected into the model as a backdoor trigger in the pre-training phase. Then, we not only directly employ the above templates in the downstream task, but also use GPT4 to generate templates with similar tone to the above templates to carry out transferable attacks.
arXiv Detail & Related papers (2023-11-29T08:12:09Z)
Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks. We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively. Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.