Related papers: Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates

Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates

URL: http://arxiv.org/abs/2602.04653v2
Date: Thu, 05 Feb 2026 11:59:44 GMT
Title: Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates
Authors: Ariel Fogel, Omer Hofman, Eilon Cohen, Roman Vainshtein,
Abstract summary: Chat templates are executable Jinja2 programs invoked at every inference call.<n>We show that an adversary who distributes a model with a maliciously modified template can implant an inference-time backdoor.<n>Backdoors generalize across inference runtimes and evade all automated security scans applied by the largest open-weight distribution platform.
Score: 3.823638706744939
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Open-weight language models are increasingly used in production settings, raising new security challenges. One prominent threat in this context is backdoor attacks, in which adversaries embed hidden behaviors in language models that activate under specific conditions. Previous work has assumed that adversaries have access to training pipelines or deployment infrastructure. We propose a novel attack surface requiring neither, which utilizes the chat template. Chat templates are executable Jinja2 programs invoked at every inference call, occupying a privileged position between user input and model processing. We show that an adversary who distributes a model with a maliciously modified template can implant an inference-time backdoor without modifying model weights, poisoning training data, or controlling runtime infrastructure. We evaluated this attack vector by constructing template backdoors targeting two objectives: degrading factual accuracy and inducing emission of attacker-controlled URLs, and applied them across eighteen models spanning seven families and four inference engines. Under triggered conditions, factual accuracy drops from 90% to 15% on average while attacker-controlled URLs are emitted with success rates exceeding 80%; benign inputs show no measurable degradation. Backdoors generalize across inference runtimes and evade all automated security scans applied by the largest open-weight distribution platform. These results establish chat templates as a reliable and currently undefended attack surface in the LLM supply chain.

Related papers

BadTemplate: A Training-Free Backdoor Attack via Chat Template Against Large Language Models [28.29280600867087]
Chat template is a technique used in the training and inference stages of Large Language Models.<n>BadTemplate is a training-free backdoor attack that inserts malicious instructions directly into the system prompt.<n>It achieves up to a 100% attack success rate and significantly outperforms traditional prompt-based backdoors in both word-level and sentence-level attacks.
arXiv Detail & Related papers (2026-02-05T07:34:35Z)
Assimilation Matters: Model-level Backdoor Detection in Vision-Language Pretrained Models [71.44858461725893]
Given a model fine-tuned by an untrusted third party, determining whether the model has been injected with a backdoor is a critical and challenging problem.<n>Existing detection methods usually rely on prior knowledge of training dataset, backdoor triggers and targets.<n>We introduce Assimilation Matters in DETection (AMDET), a novel model-level detection framework that operates without any such prior knowledge.
arXiv Detail & Related papers (2025-11-29T06:20:00Z)
AutoBackdoor: Automating Backdoor Attacks via LLM Agents [35.216857373810875]
Backdoor attacks pose a serious threat to the secure deployment of large language models (LLMs)<n>In this work, we introduce textscAutoBackdoor, a general framework for automating backdoor injection.<n>Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases.
arXiv Detail & Related papers (2025-11-20T03:58:54Z)
Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution [49.78359632298156]
Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks.<n> backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated.<n>We present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution.
arXiv Detail & Related papers (2025-08-28T17:05:18Z)
Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Neutralizing Backdoors through Information Conflicts for Large Language Models [20.6331157117675]
We present a novel method to eliminate backdoor behaviors from large language models (LLMs)<n>We leverage a lightweight dataset to train a conflict model, which is then merged with the backdoored model to neutralize malicious behaviors.<n>We can reduce the attack success rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean data accuracy.
arXiv Detail & Related papers (2024-11-27T12:15:22Z)
Exploiting the Vulnerability of Large Language Models via Defense-Aware Architectural Backdoor [0.24335447922683692]
We introduce a new type of backdoor attack that conceals itself within the underlying model architecture. The add-on modules of model architecture layers can detect the presence of input trigger tokens and modify layer weights. We conduct extensive experiments to evaluate our attack methods using two model architecture settings on five different large language datasets.
arXiv Detail & Related papers (2024-09-03T14:54:16Z)
MSDT: Masked Language Model Scoring Defense in Text Domain [16.182765935007254]
We will introduce a novel improved textual backdoor defense method, named MSDT, that outperforms the current existing defensive algorithms in specific datasets. experimental results illustrate that our method can be effective and constructive in terms of defending against backdoor attack in text domain.
arXiv Detail & Related papers (2022-11-10T06:46:47Z)
Untargeted Backdoor Attack against Object Detection [69.63097724439886]
We design a poison-only backdoor attack in an untargeted manner, based on task characteristics. We show that, once the backdoor is embedded into the target model by our attack, it can trick the model to lose detection of any object stamped with our trigger patterns.
arXiv Detail & Related papers (2022-11-02T17:05:45Z)
Backdoor Attacks on Crowd Counting [63.90533357815404]
Crowd counting is a regression task that estimates the number of people in a scene image. In this paper, we investigate the vulnerability of deep learning based crowd counting models to backdoor attacks.
arXiv Detail & Related papers (2022-07-12T16:17:01Z)
Check Your Other Door! Establishing Backdoor Attacks in the Frequency Domain [80.24811082454367]
We show the advantages of utilizing the frequency domain for establishing undetectable and powerful backdoor attacks. We also show two possible defences that succeed against frequency-based backdoor attacks and possible ways for the attacker to bypass them.
arXiv Detail & Related papers (2021-09-12T12:44:52Z)
Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution [57.51117978504175]
Recent studies show that neural natural language processing (NLP) models are vulnerable to backdoor attacks. Injected with backdoors, models perform normally on benign examples but produce attacker-specified predictions when the backdoor is activated. We present invisible backdoors that are activated by a learnable combination of word substitution.
arXiv Detail & Related papers (2021-06-11T13:03:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.