Related papers: Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification

Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification

URL: http://arxiv.org/abs/2511.21752v1
Date: Sun, 23 Nov 2025 20:16:51 GMT
Title: Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification
Authors: Yanxi Li, Ruocheng Shan,
Abstract summary: This paper introduces Label Disguise Defense (LDD), a lightweight strategy that conceals true labels by replacing them with semantically transformed alias labels.<n>We evaluate LDD across nine state-of-the-art models, including GPT-5, GPT-4o, LLaMA3.2, Gemma3, and Mistral variants.
Score: 5.963719408944521
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models are increasingly used for text classification tasks such as sentiment analysis, yet their reliance on natural language prompts exposes them to prompt injection attacks. In particular, class-directive injections exploit knowledge of the model's label set (e.g., positive vs. negative) to override its intended behavior through adversarial instructions. Existing defenses, such as detection-based filters, instruction hierarchies, and signed prompts, either require model retraining or remain vulnerable to obfuscation. This paper introduces Label Disguise Defense (LDD), a lightweight and model-agnostic strategy that conceals true labels by replacing them with semantically transformed or unrelated alias labels(e.g., blue vs. yellow). The model learns these new label mappings implicitly through few-shot demonstrations, preventing direct correspondence between injected directives and decision outputs. We evaluate LDD across nine state-of-the-art models, including GPT-5, GPT-4o, LLaMA3.2, Gemma3, and Mistral variants, under varying few-shot and an adversarial setting. Our results show that the ability of LDD to recover performance lost to the adversarial attack varies across models and alias choices. For every model evaluated, LDD is able to restore a portion of the accuracy degradation caused by the attack. Moreover, for the vast majority of models, we can identify more than one alias pair that achieves higher accuracy than the under-attack baseline, in which the model relies solely on few-shot learning without any defensive mechanism. A linguistic analysis further reveals that semantically aligned alias labels(e.g., good vs. bad) yield stronger robustness than unaligned symbols(e.g., blue vs. yellow). Overall, this study demonstrates that label semantics can serve as an effective defense layer, transforming meaning itself into a shield against prompt injection.

Related papers

DRIP: Defending Prompt Injection via De-instruction Training and Residual Fusion Model Architecture [21.45291667976768]
Large language models (LLMs) have demonstrated impressive instruction-following capabilities.<n>A core vulnerability lies in the model's lack of semantic role understanding.<n>We propose DRIP, a training-time defense grounded in a semantic modeling perspective.
arXiv Detail & Related papers (2025-11-01T08:26:37Z)
TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models [57.32952956674526]
We introduce TokenSwap, a more evasive and stealthy backdoor attack on large vision-language models (LVLMs)<n>Instead of enforcing a fixed targeted content, TokenSwap subtly disrupts the understanding of object relationships in text.<n> TokenSwap achieves high attack success rates while maintaining superior evasiveness and stealthiness.
arXiv Detail & Related papers (2025-09-29T10:19:22Z)
Character-Level Perturbations Disrupt LLM Watermarks [64.60090923837701]
We formalize the system model for Large Language Model (LLM) watermarking.<n>We characterize two realistic threat models constrained on limited access to the watermark detector.<n>We demonstrate character-level perturbations are significantly more effective for watermark removal under the most restrictive threat model.<n> Experiments confirm the superiority of character-level perturbations and the effectiveness of the Genetic Algorithm (GA) in removing watermarks under realistic constraints.
arXiv Detail & Related papers (2025-09-11T02:50:07Z)
LADSG: Label-Anonymized Distillation and Similar Gradient Substitution for Label Privacy in Vertical Federated Learning [15.24974575465626]
We propose Label-Anonymized Defense with Substitution Gradient (LADSG), a unified and lightweight defense framework for Vertical Federated Learning (VFL)<n>LADSG first anonymizes true labels via soft distillation to reduce semantic exposure, then generates semantically-aligned substitute gradients to disrupt gradient-based leakage, and finally filters anomalous updates through gradient norm detection.<n>Extensive experiments on six real-world datasets show that LADSG reduces the success rates of all three types of label inference attacks by 30-60% with minimal computational overhead, demonstrating its practical effectiveness.
arXiv Detail & Related papers (2025-06-07T10:10:56Z)
Humans Hallucinate Too: Language Models Identify and Correct Subjective Annotation Errors With Label-in-a-Haystack Prompts [41.162545164426085]
We explore label verification in contexts using Large Language Models (LLMs)<n>We propose the Label-in-a-Haystack Rectification (LiaHR) framework for subjective label correction.<n>This approach can be integrated into annotation pipelines to enhance signal-to-noise ratios.
arXiv Detail & Related papers (2025-05-22T18:55:22Z)
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models [102.63973600144308]
Open-source large language models can be easily subverted to generate harmful content. Experiments across 8 models released by 5 different organizations demonstrate the effectiveness of shadow alignment attack. This study serves as a clarion call for a collective effort to overhaul and fortify the safety of open-source LLMs against malicious attackers.
arXiv Detail & Related papers (2023-10-04T16:39:31Z)
In and Out-of-Domain Text Adversarial Robustness via Label Smoothing [64.66809713499576]
We study the adversarial robustness provided by various label smoothing strategies in foundational models for diverse NLP tasks. Our experiments show that label smoothing significantly improves adversarial robustness in pre-trained models like BERT, against various popular attacks. We also analyze the relationship between prediction confidence and robustness, showing that label smoothing reduces over-confident errors on adversarial examples.
arXiv Detail & Related papers (2022-12-20T14:06:50Z)
Triggerless Backdoor Attack for NLP Tasks with Clean Labels [31.308324978194637]
A standard strategy to construct poisoned data in backdoor attacks is to insert triggers into selected sentences and alter the original label to a target label. This strategy comes with a severe flaw of being easily detected from both the trigger and the label perspectives. We propose a new strategy to perform textual backdoor attacks which do not require an external trigger, and the poisoned samples are correctly labeled.
arXiv Detail & Related papers (2021-11-15T18:36:25Z)
Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model. In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z)
Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data. We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level. Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z)
Generating Label Cohesive and Well-Formed Adversarial Claims [44.29895319592488]
Adversarial attacks reveal important vulnerabilities and flaws of trained models. We investigate how to generate adversarial attacks against fact checking systems that preserve the ground truth meaning. We find that the generated attacks maintain the directionality and semantic validity of the claim better than previous work.
arXiv Detail & Related papers (2020-09-17T10:50:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.