NEST: Nascent Encoded Steganographic Thoughts
- URL: http://arxiv.org/abs/2602.14095v1
- Date: Sun, 15 Feb 2026 11:05:18 GMT
- Title: NEST: Nascent Encoded Steganographic Thoughts
- Authors: Artem Karpov,
- Abstract summary: This study explores the potential for steganographic reasoning to inform risk assessment and deployment policies.<n>We measure evasion, refusal rates, encoding fidelity, and hidden task accuracy across four datasets.<n>We find that current models cannot yet sustain hidden reasoning for complex math and arithmetic tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Monitoring chain-of-thought (CoT) reasoning is a foundational safety technique for large language model (LLM) agents; however, this oversight is compromised if models learn to conceal their reasoning. We explore the potential for steganographic CoT -- where models hide secret reasoning within innocuous text -- to inform risk assessment and deployment policies. We systematically evaluate the limits of steganographic capabilities across 28 models, ranging from past generations to the current frontier. We measure monitor evasion, refusal rates, encoding fidelity, and hidden task accuracy across four datasets, comparing steganographic acrostics against plain reasoning and filler-token baselines. We find that current models cannot yet sustain hidden reasoning for complex math and arithmetic tasks. However, in a simplified counting experiment, Claude Opus 4.5 achieved 92% accuracy on the hidden task, demonstrating nascent capability. Notably, in rare cases (<1%), GPT-5.2 might refuse steganographic instructions while simultaneously complying with them. Our findings underscore the need for continuous evaluation of steganographic risks. This study provides a methodology to preemptively detect and prevent hidden reasoning that might empower misaligned scheming and deceptive behavior.
Related papers
- A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring [46.351075821275806]
We propose an alternative, textbfdecision-theoretic view of steganography.<n>Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content.<n>We use this to define the textbfsteganographic gap -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content.
arXiv Detail & Related papers (2026-02-26T16:27:24Z) - False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize [30.448801772258644]
Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities.<n>Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations.<n>Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness.
arXiv Detail & Related papers (2025-09-04T05:15:55Z) - Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs [27.544312683007234]
We introduce a new method for understanding, monitoring and controlling fine-tuned large language models (LLMs)<n>We demonstrate that the top singular of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors.<n>For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%.
arXiv Detail & Related papers (2025-07-31T21:04:12Z) - The Steganographic Potentials of Language Models [0.0]
Large language models (LLMs) can hide messages within plain text (steganography)<n>We explore the steganographic capabilities of LLMs fine-tuned via reinforcement learning (RL)<n>Our findings reveal that while current models exhibit rudimentary steganographic abilities in terms of security and capacity, explicit algorithmic guidance markedly enhances their capacity for information concealment.
arXiv Detail & Related papers (2025-05-06T11:25:52Z) - Hide in Plain Sight: Clean-Label Backdoor for Auditing Membership Inference [16.893873979953593]
We propose a novel clean-label backdoor-based approach for stealthy data auditing.
Our approach employs an optimal trigger generated by a shadow model that mimics target model's behavior.
The proposed method enables robust data auditing through blackbox access, achieving high attack success rates across diverse datasets.
arXiv Detail & Related papers (2024-11-24T20:56:18Z) - Generative Edge Detection with Stable Diffusion [52.870631376660924]
Edge detection is typically viewed as a pixel-level classification problem mainly addressed by discriminative methods.
We propose a novel approach, named Generative Edge Detector (GED), by fully utilizing the potential of the pre-trained stable diffusion model.
We conduct extensive experiments on multiple datasets and achieve competitive performance.
arXiv Detail & Related papers (2024-10-04T01:52:23Z) - Natias: Neuron Attribution based Transferable Image Adversarial Steganography [62.906821876314275]
adversarial steganography has garnered considerable attention due to its ability to effectively deceive deep-learning-based steganalysis.
We propose a novel adversarial steganographic scheme named Natias.
Our proposed method can be seamlessly integrated with existing adversarial steganography frameworks.
arXiv Detail & Related papers (2024-09-08T04:09:51Z) - Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal.
Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths.
Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z) - Model X-ray:Detecting Backdoored Models via Decision Boundary [62.675297418960355]
Backdoor attacks pose a significant security vulnerability for deep neural networks (DNNs)
We propose Model X-ray, a novel backdoor detection approach based on the analysis of illustrated two-dimensional (2D) decision boundaries.
Our approach includes two strategies focused on the decision areas dominated by clean samples and the concentration of label distribution.
arXiv Detail & Related papers (2024-02-27T12:42:07Z) - Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries.
Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill.
We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.