Related papers: Early Signs of Steganographic Capabilities in Frontier LLMs

Early Signs of Steganographic Capabilities in Frontier LLMs

URL: http://arxiv.org/abs/2507.02737v2
Date: Tue, 14 Oct 2025 18:36:42 GMT
Title: Early Signs of Steganographic Capabilities in Frontier LLMs
Authors: Artur Zolkowski, Kei Nishimura-Gasparian, Robert McCarthy, Roland S. Zimmermann, David Lindner,
Abstract summary: Large Language Models could evade monitoring through steganography.<n>We focus on two types of steganography: passing encoded messages and performing encoded reasoning.<n>We find early signs that models can perform basic encoded reasoning in a simple state-tracking problem.
Score: 7.3833268176766245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Monitoring Large Language Model (LLM) outputs is crucial for mitigating risks from misuse and misalignment. However, LLMs could evade monitoring through steganography: Encoding hidden information within seemingly benign generations. In this paper, we evaluate the steganography capabilities in frontier LLMs to better understand the risk they pose. We focus on two types of steganography: passing encoded messages and performing encoded reasoning. We find that current models are unable to encode short messages in their outputs without a monitor noticing under standard affordances. They can succeed, however, if given additional affordances like using an unmonitored scratchpad and coordinating on what encoding scheme to use. We additionally find early signs that models can perform basic encoded reasoning in a simple state-tracking problem. This includes some ability to reason with their own and pre-defined schemes, including encoding schemes such as Hexadecimal. Despite this, they can rarely hide reasoning subtly within a cover task to fool a monitor. Overall, our results indicate that current LLMs exhibit nascent steganographic capabilities. While these capabilities are likely insufficient to bypass well-designed monitors at present, this could change in the future.

Related papers

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring [46.351075821275806]
We propose an alternative, textbfdecision-theoretic view of steganography.<n>Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content.<n>We use this to define the textbfsteganographic gap -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content.
arXiv Detail & Related papers (2026-02-26T16:27:24Z)
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models [67.15793594651609]
Traditional safety monitors require the same amount of compute for every query.<n>We introduce Truncated Polynomials (TPCs), a natural extension of linear probes for dynamic activation monitoring.<n>Our key insight is that TPCs can be trained and evaluated progressively, term-by-term.
arXiv Detail & Related papers (2025-09-30T13:32:59Z)
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors? [3.661279101881241]
We introduce RL-Obfuscation, in which LLMs are finetuned via reinforcement learning to bypass latent-space monitors.<n>We find that token-level latent-space monitors are highly vulnerable to this attack.<n>We show that adversarial policies trained to evade a single static monitor generalise to unseen monitors of the same type.
arXiv Detail & Related papers (2025-06-17T07:22:20Z)
Exploring the Hidden Capacity of LLMs for One-Step Text Generation [3.5785385789441158]
We show that frozen large language models can generate hundreds of accurate tokens in just one token-parallel forward pass.<n>Although these representations are not unique for a given text, they form connected and local regions in embedding space.<n>We also empirically show that, although these representations are not unique for a given text, they form connected and local regions in embedding space.
arXiv Detail & Related papers (2025-05-27T13:39:24Z)
The Steganographic Potentials of Language Models [0.0]
Large language models (LLMs) can hide messages within plain text (steganography)<n>We explore the steganographic capabilities of LLMs fine-tuned via reinforcement learning (RL)<n>Our findings reveal that while current models exhibit rudimentary steganographic abilities in terms of security and capacity, explicit algorithmic guidance markedly enhances their capacity for information concealment.
arXiv Detail & Related papers (2025-05-06T11:25:52Z)
Learning on LLM Output Signatures for gray-box Behavior Analysis [52.81120759532526]
Large Language Models (LLMs) have achieved widespread adoption, yet our understanding of their behavior remains limited.<n>We develop a transformer-based approach to process contamination and data detection in gray-box settings.<n>Our approach achieves superior performance on hallucination and data detection in gray-box settings, significantly outperforming existing baselines.
arXiv Detail & Related papers (2025-03-18T09:04:37Z)
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z)
FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step. We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z)
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation [50.73561815838431]
Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena.<n>We present an empirical analysis and find that, although MLLMs incorrectly generate the objects in the final output, they are actually able to recognize visual objects in the preceding layers.<n>Motivated by this, we propose a novel dynamic correction decoding method for MLLMs DeCo, which adaptively selects the appropriate preceding layers and proportionally integrates knowledge into the final layer to adjust the output logits.
arXiv Detail & Related papers (2024-10-15T16:57:44Z)
Case2Code: Scalable Synthetic Data for Code Generation [105.89741089673575]
Large Language Models (LLMs) have shown outstanding breakthroughs in code generation.<n>Recent work improves code LLMs by training on synthetic data generated by some powerful LLMs.<n>We propose a textbfCase2Code task by exploiting the expressiveness and correctness of programs.
arXiv Detail & Related papers (2024-07-17T11:35:00Z)
Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher [11.136112399898481]
How can small-scale large language models (LLMs) efficiently utilize the supervision of LLMs to improve their generative quality? We develop an algorithm to effectively aggregate the small-scale LLM and LLM predictions on initial tokens. We demonstrate that our method provides a consistent improvement over conventional decoding strategies.
arXiv Detail & Related papers (2024-06-26T01:16:12Z)
VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development. We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM) We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z)
Large Language Models as Carriers of Hidden Messages [0.0]
Simple fine-tuning can embed hidden text into large language models (LLMs), which is revealed only when triggered by a specific query.<n>Our work demonstrates that embedding hidden text via fine-tuning, although seemingly secure due to the vast number of potential triggers, is vulnerable to extraction.<n>We introduce an extraction attack called Unconditional Token Forcing (UTF), which iteratively feeds tokens from the LLM's vocabulary to reveal sequences with high token probabilities, indicating hidden text candidates.
arXiv Detail & Related papers (2024-06-04T16:49:06Z)
Assured LLM-Based Software Engineering [51.003878077888686]
This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
arXiv Detail & Related papers (2024-02-06T20:38:46Z)
Dynamically Allocated Interval-Based Generative Linguistic Steganography with Roulette Wheel [10.72286166021398]
Existing linguistic steganography schemes often overlook the conditional probability (CP) of tokens in the candidate pool.<n>This paper proposes a scheme based on the interval allocated, called DAIRstega.
arXiv Detail & Related papers (2024-01-28T13:21:44Z)
DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions [27.489622263456983]
We introduce DeceptPrompt, an algorithm that can generate adversarial natural language instructions that drive the Code LLMs to generate functionality correct code with vulnerabilities. When applying the optimized prefix/suffix, the attack success rate (ASR) will improve by average 50% compared with no prefix/suffix applying.
arXiv Detail & Related papers (2023-12-07T22:19:06Z)
Stochastic Parrots Looking for Stochastic Parrots: LLMs are Easy to Fine-Tune and Hard to Detect with other LLMs [6.295207672539996]
We show that an attacker with access to detectors' reference human texts and output can fully frustrate the detector training. We warn against the temptation to transpose the conclusions obtained in RNN-driven text GANs to LLMs. These results have critical implications for the detection and prevention of malicious use of generative language models.
arXiv Detail & Related papers (2023-04-18T13:05:01Z)
Can AI-Generated Text be Reliably Detected? [50.95804851595018]
Large Language Models (LLMs) perform impressively well in various applications.<n>The potential for misuse of these models in activities such as plagiarism, generating fake news, and spamming has raised concern about their responsible use.<n>We stress-test the robustness of these AI text detectors in the presence of an attacker.
arXiv Detail & Related papers (2023-03-17T17:53:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.