TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent
- URL: http://arxiv.org/abs/2505.20118v3
- Date: Mon, 29 Sep 2025 11:08:33 GMT
- Title: TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent
- Authors: Dominik Meier, Jan Philip Wahle, Paul Röttger, Terry Ruas, Bela Gipp,
- Abstract summary: We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography.<n>We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat.<n> Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations.
- Score: 22.364938679083053
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.
Related papers
- Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack [53.34204977366491]
Large language models (LLMs) remain vulnerable to jailbreaking attacks despite their impressive capabilities.<n>In this paper, we introduce ISA (Intent Shift Attack), which obfuscates LLMs about the intent of the attacks.<n>Our approach only needs minimal edits to the original request, and yields natural, human-readable, and seemingly harmless prompts.
arXiv Detail & Related papers (2025-11-01T13:44:42Z) - Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation [66.84286617519258]
Large language models are transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis.<n>Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I (false positive), Type II (false negative), Type S (wrong sign), or Type M (exaggerated effect) errors.<n>We find that intentional LLM hacking is strikingly simple. By replicating 37 data annotation tasks from 21 published social science studies, we show that, with just a handful of prompt paraphrases, virtually anything can be presented as statistically significant.
arXiv Detail & Related papers (2025-09-10T17:58:53Z) - Paladin: Defending LLM-enabled Phishing Emails with a New Trigger-Tag Paradigm [26.399199616508596]
Malicious users can synthesize phishing emails that are free from spelling mistakes and other easily detectable features.<n>Such models can generate topic-specific phishing messages, tailoring content to the target domain.<n>Most existing semantic-level detection approaches struggle to identify them reliably.<n>We propose Paladin, which embeds trigger-tag associations into vanilla LLM using various insertion strategies.<n>When an instrumented LLM generates content related to phishing, it will automatically include detectable tags, enabling easier identification.
arXiv Detail & Related papers (2025-09-08T23:44:00Z) - Model Inversion Attacks on Llama 3: Extracting PII from Large Language Models [0.0]
Large language models (LLMs) have transformed natural language processing, but their ability to memorize training data poses significant privacy risks.<n>This paper investigates model inversion attacks on the Llama 3.2 model, a multilingual LLM developed by Meta.
arXiv Detail & Related papers (2025-07-06T17:24:17Z) - Invisible Prompts, Visible Threats: Malicious Font Injection in External Resources for Large Language Models [29.879456712405204]
Large Language Models (LLMs) are increasingly equipped with capabilities of real-time web search and integrated with protocols like Model Context Protocol (MCP)<n>This extension could introduce new security vulnerabilities.<n>We present a systematic investigation of LLM vulnerabilities to hidden adversarial prompts through malicious font injection in external resources like webpages.
arXiv Detail & Related papers (2025-05-22T17:36:33Z) - Anti-adversarial Learning: Desensitizing Prompts for Large Language Models [13.674984661911607]
We propose PromptObfus, a novel method for desensitizing LLM prompts.<n>The core idea of PromptObfus is "anti-adversarial" learning, which perturbs privacy words in the prompt to obscure sensitive information.<n>We show that PromptObfus effectively prevents privacy inference from remote LLMs while preserving task performance.
arXiv Detail & Related papers (2025-04-25T06:19:02Z) - PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage [78.33839735526769]
LLMs may be fooled into outputting private information under carefully crafted adversarial prompts.<n>PrivAgent is a novel black-box red-teaming framework for privacy leakage.
arXiv Detail & Related papers (2024-12-07T20:09:01Z) - SVIP: Towards Verifiable Inference of Open-source Large Language Models [33.910670775972335]
Open-source Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language understanding and generation, leading to widespread adoption across various domains.
Their increasing model sizes render local deployment impractical for individual users, pushing many to rely on computing service providers for inference through a blackbox API.
This reliance introduces a new risk: a computing provider may stealthily substitute the requested LLM with a smaller, less capable model without consent from users, thereby delivering inferior outputs while benefiting from cost savings.
arXiv Detail & Related papers (2024-10-29T17:52:45Z) - LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts [88.96201324719205]
Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training.<n>We identify a new safety vulnerability in LLMs, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms.<n>We introduce a novel attack method, textitActorBreaker, which identifies actors related to toxic prompts within pre-training distribution.
arXiv Detail & Related papers (2024-10-14T16:41:49Z) - Extracting Memorized Training Data via Decomposition [24.198975804570072]
We demonstrate a simple, query-based decompositional method to extract news articles from two frontier Large Language Models.
We extract at least one sentence from 73 articles, and over 20% of verbatim sentences from 6 articles.
If replicable at scale, this training data extraction methodology could expose new LLM security and safety vulnerabilities.
arXiv Detail & Related papers (2024-09-18T23:59:32Z) - MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities.
Their powerful generative abilities enable flexible responses based on various queries or instructions.
This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z) - CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion [117.178835165855]
This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs.
Our studies reveal a new and universal safety vulnerability of these models against code input.
We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
arXiv Detail & Related papers (2024-03-12T17:55:38Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - Learning to Poison Large Language Models During Instruction Tuning [12.521338629194503]
This work identifies additional security risks in Large Language Models (LLMs) by designing a new data poisoning attack tailored to exploit the instruction tuning process.
We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently.
We propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL)
arXiv Detail & Related papers (2024-02-21T01:30:03Z) - Multilingual Jailbreak Challenges in Large Language Models [96.74878032417054]
In this study, we reveal the presence of multilingual jailbreak challenges within large language models (LLMs)
We consider two potential risky scenarios: unintentional and intentional.
We propose a novel textscSelf-Defense framework that automatically generates multilingual training data for safety fine-tuning.
arXiv Detail & Related papers (2023-10-10T09:44:06Z) - Trojaning Language Models for Fun and Profit [53.45727748224679]
TROJAN-LM is a new class of trojaning attacks in which maliciously crafted LMs trigger host NLP systems to malfunction.
By empirically studying three state-of-the-art LMs in a range of security-critical NLP tasks, we demonstrate that TROJAN-LM possesses the following properties.
arXiv Detail & Related papers (2020-08-01T18:22:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.