Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models
- URL: http://arxiv.org/abs/2601.22818v1
- Date: Fri, 30 Jan 2026 10:43:43 GMT
- Title: Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models
- Authors: Charles Westphal, Keivan Navaie, Fernando E. Rosas,
- Abstract summary: Fine-tuned LLMs can covertly encode prompt secrets into outputs via steganographic channels.<n>We show previous schemes achieve 100% recoverability by replacing arbitrary mappings with embedding-space-derived ones.<n>We argue that detecting fine-tuning-based steganographic attacks requires approaches beyond traditional steganalysis.
- Score: 44.41218866933059
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuned LLMs can covertly encode prompt secrets into outputs via steganographic channels. Prior work demonstrated this threat but relied on trivially recoverable encodings. We formalize payload recoverability via classifier accuracy and show previous schemes achieve 100\% recoverability. In response, we introduce low-recoverability steganography, replacing arbitrary mappings with embedding-space-derived ones. For Llama-8B (LoRA) and Ministral-8B (LoRA) trained on TrojanStego prompts, exact secret recovery rises from 17$\rightarrow$30\% (+78\%) and 24$\rightarrow$43\% (+80\%) respectively, while on Llama-70B (LoRA) trained on Wiki prompts, it climbs from 9$\rightarrow$19\% (+123\%), all while reducing payload recoverability. We then discuss detection. We argue that detecting fine-tuning-based steganographic attacks requires approaches beyond traditional steganalysis. Standard approaches measure distributional shift, which is an expected side-effect of fine-tuning. Instead, we propose a mechanistic interpretability approach: linear probes trained on later-layer activations detect the secret with up to 33\% higher accuracy in fine-tuned models compared to base models, even for low-recoverability schemes. This suggests that malicious fine-tuning leaves actionable internal signatures amenable to interpretability-based defenses.
Related papers
- Poisoned Acoustics [0.0]
Training-data poisoning attacks can induce targeted, undetectable failure in deep neural networks by corrupting a vanishingly small fraction of training labels.<n>We demonstrate this on acoustic vehicle classification using the MELAUDIS urban intersection dataset.
arXiv Detail & Related papers (2026-02-25T01:09:43Z) - IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking [67.20568716300272]
Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking.<n>We introduce IR3 (Interpretable Reward Reconstruction and Rectification), a framework that reverse-engineers, interprets, and surgically repairs the implicit objectives driving RLHF-tuned models.<n>We show that IR3 achieves 0.89 correlation with ground-truth rewards, identifies hacking features with over 90% precision, and significantly reduces hacking behaviors while maintaining capabilities within 3% of the original model.
arXiv Detail & Related papers (2026-02-23T01:14:53Z) - The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search [58.8834056209347]
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs.<n>We introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base.
arXiv Detail & Related papers (2025-12-01T07:05:23Z) - Peekaboo, I See Your Queries: Passive Attacks Against DSSE Via Intermittent Observations [43.35160637778568]
DSSE allows secure searches over a dynamic encrypted database but suffers from inherent information leakage.<n>We propose Peekaboo - a new universal attack framework - and the core design relies on inferring the search pattern.<n>Our design achieves >0.9 adjusted rand index for search pattern recovery and 90% query accuracy vs. FMA's 30%.
arXiv Detail & Related papers (2025-09-04T01:47:22Z) - Mechanistic Interpretability in the Presence of Architectural Obfuscation [0.0]
Architectural obfuscation is a lightweight substitute for heavyweight cryptography in privacy-preserving large-language-model (LLM) inference.<n>We analyze a GPT-2-small model trained from scratch with a representative obfuscation map.<n>Our findings reveal that obfuscation dramatically alters activation patterns within attention heads yet preserves the layer-wise computational graph.
arXiv Detail & Related papers (2025-06-22T14:39:16Z) - Through the Stealth Lens: Rethinking Attacks and Defenses in RAG [21.420202472493425]
We show that RevalVariRAG systems are vulnerable to poisoned passages into the set, even at low corruption rates.<n>We show that attacks at even low rates are not designed to be reliable, allowing detection and mitigation.
arXiv Detail & Related papers (2025-06-04T19:15:09Z) - LPASS: Linear Probes as Stepping Stones for vulnerability detection using compressed LLMs [0.0]
We show how Linear Probes can be used to provide an estimation on the performance of a compressed large language model.<n>We also show their suitability to set the cut-off point when applying layer pruning compression.<n>Our approach, dubbed $LPASS$, is applied in BERT and Gemma for the detection of 12 of MITRE's Top 25 most dangerous vulnerabilities on 480k C/C++ samples.
arXiv Detail & Related papers (2025-05-30T10:37:14Z) - Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer.
arXiv Detail & Related papers (2025-05-23T08:13:59Z) - Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders [115.34050914216665]
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models.
We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features.
We assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models.
arXiv Detail & Related papers (2024-10-27T17:33:49Z) - Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal.
Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths.
Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z) - Language Model Unalignment: Parametric Red-Teaming to Expose Hidden
Harms and Biases [32.2246459413988]
Red-teaming aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query.
We present a new perspective on safety research i.e., red-teaming through Unalignment.
Unalignment tunes the model parameters to break model guardrails that are not deeply rooted in the model's behavior.
arXiv Detail & Related papers (2023-10-22T13:55:46Z) - Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data.
We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level.
Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z) - Confusing and Detecting ML Adversarial Attacks with Injected Attractors [13.939695351344538]
A machine learning adversarial attack finds adversarial samples of a victim model $mathcal M$ by following the gradient of some attack objective functions.
We take the proactive approach that modifies those functions with the goal of misleading the attacks to some local minimals.
We observe that decoders of watermarking schemes exhibit properties of attractors and give a generic method that injects attractors into the victim model.
arXiv Detail & Related papers (2020-03-05T16:02:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.