Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
- URL: http://arxiv.org/abs/2410.03768v1
- Date: Wed, 2 Oct 2024 16:18:33 GMT
- Title: Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
- Authors: Yohan Mathew, Ollie Matthews, Robert McCarthy, Joan Velja, Christian Schroeder de Witt, Dylan Cope, Nandi Schoots,
- Abstract summary: Collusion to the disadvantage of others has been identified as a central form of undesirable agent cooperation.
The use of information hiding (steganography) in agent communications could render collusion practically undetectable.
This underscores the need for evaluation frameworks to monitor and mitigate steganographic collusion capabilities.
- Score: 0.600808022072121
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid proliferation of frontier model agents promises significant societal advances but also raises concerns about systemic risks arising from unsafe interactions. Collusion to the disadvantage of others has been identified as a central form of undesirable agent cooperation. The use of information hiding (steganography) in agent communications could render collusion practically undetectable. This underscores the need for evaluation frameworks to monitor and mitigate steganographic collusion capabilities. We address a crucial gap in the literature by demonstrating, for the first time, that robust steganographic collusion in LLMs can arise indirectly from optimization pressure. To investigate this problem we design two approaches -- a gradient-based reinforcement learning (GBRL) method and an in-context reinforcement learning (ICRL) method -- for reliably eliciting sophisticated LLM-generated linguistic text steganography. Importantly, we find that emergent steganographic collusion can be robust to both passive steganalytic oversight of model outputs and active mitigation through communication paraphrasing. We contribute a novel model evaluation framework and discuss limitations and future work. Our findings imply that effective risk mitigation from steganographic collusion post-deployment requires innovation in passive and active oversight techniques.
Related papers
- Enhanced Online Grooming Detection Employing Context Determination and Message-Level Analysis [2.424910201171407]
Online grooming (OG) is a prevalent threat facing predominately children online, with groomers using deceptive methods to prey on the vulnerability of children on social media/messaging platforms.
Existing solutions focus on the signature analysis of child abuse media, which does not effectively address real-time OG detection.
This paper proposes that OG attacks are complex, requiring the identification of specific communication patterns between adults and children.
arXiv Detail & Related papers (2024-09-12T11:37:34Z) - Compromising Embodied Agents with Contextual Backdoor Attacks [69.71630408822767]
Large language models (LLMs) have transformed the development of embodied intelligence.
This paper uncovers a significant backdoor security threat within this process.
By poisoning just a few contextual demonstrations, attackers can covertly compromise the contextual environment of a black-box LLM.
arXiv Detail & Related papers (2024-08-06T01:20:12Z) - Secret Collusion among Generative AI Agents [43.468790060808914]
Recent capability increases in large language models (LLMs) open up applications in which groups of communicating generative AI agents solve joint tasks.
This poses privacy and security challenges concerning the unauthorised sharing of information.
Modern steganographic techniques could render such dynamics hard to detect.
arXiv Detail & Related papers (2024-02-12T09:31:21Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks.
We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision.
Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z) - Towards Imperceptible Document Manipulations against Neural Ranking
Models [13.777462017782659]
We propose a framework called Imperceptible DocumEnt Manipulation (IDEM) to produce adversarial documents.
IDEM instructs a well-established generative language model, such as BART, to generate connection sentences without introducing easy-to-detect errors.
We show that IDEM can outperform strong baselines while preserving fluency and correctness of the target documents.
arXiv Detail & Related papers (2023-05-03T02:09:29Z) - Proactive Pseudo-Intervention: Causally Informed Contrastive Learning
For Interpretable Vision Models [103.64435911083432]
We present a novel contrastive learning strategy called it Proactive Pseudo-Intervention (PPI)
PPI leverages proactive interventions to guard against image features with no causal relevance.
We also devise a novel causally informed salience mapping module to identify key image pixels to intervene, and show it greatly facilitates model interpretability.
arXiv Detail & Related papers (2020-12-06T20:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.