LLM Censorship: A Machine Learning Challenge or a Computer Security
Problem?
- URL: http://arxiv.org/abs/2307.10719v1
- Date: Thu, 20 Jul 2023 09:25:02 GMT
- Title: LLM Censorship: A Machine Learning Challenge or a Computer Security
Problem?
- Authors: David Glukhov, Ilia Shumailov, Yarin Gal, Nicolas Papernot, Vardan
Papyan
- Abstract summary: We show that semantic censorship can be perceived as an undecidable problem.
We argue that the challenges extend beyond semantic censorship, as knowledgeable attackers can reconstruct impermissible outputs.
- Score: 52.71988102039535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have exhibited impressive capabilities in
comprehending complex instructions. However, their blind adherence to provided
instructions has led to concerns regarding risks of malicious use. Existing
defence mechanisms, such as model fine-tuning or output censorship using LLMs,
have proven to be fallible, as LLMs can still generate problematic responses.
Commonly employed censorship approaches treat the issue as a machine learning
problem and rely on another LM to detect undesirable content in LLM outputs. In
this paper, we present the theoretical limitations of such semantic censorship
approaches. Specifically, we demonstrate that semantic censorship can be
perceived as an undecidable problem, highlighting the inherent challenges in
censorship that arise due to LLMs' programmatic and instruction-following
capabilities. Furthermore, we argue that the challenges extend beyond semantic
censorship, as knowledgeable attackers can reconstruct impermissible outputs
from a collection of permissible ones. As a result, we propose that the problem
of censorship needs to be reevaluated; it should be treated as a security
problem which warrants the adaptation of security-based approaches to mitigate
potential risks.
Related papers
- HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router [42.222681564769076]
We introduce HiddenGuard, a novel framework for fine-grained, safe generation in Large Language Models.
HiddenGuard incorporates Prism, which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content.
Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content.
arXiv Detail & Related papers (2024-10-03T17:10:41Z) - CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration [90.36429361299807]
multimodal large language models (MLLMs) have demonstrated remarkable success in engaging in conversations involving visual inputs.
The integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs.
We introduce a technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution.
arXiv Detail & Related papers (2024-09-17T17:14:41Z) - Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation [44.09578786678573]
Large Language Models (LLMs) are implicit troublemakers.
LLMs could be used to gather harmful data or launch covert attacks.
We name this decoding strategy: Jailbreak Value Decoding (JVD)
arXiv Detail & Related papers (2024-08-20T09:11:21Z) - Compromising Embodied Agents with Contextual Backdoor Attacks [69.71630408822767]
Large language models (LLMs) have transformed the development of embodied intelligence.
This paper uncovers a significant backdoor security threat within this process.
By poisoning just a few contextual demonstrations, attackers can covertly compromise the contextual environment of a black-box LLM.
arXiv Detail & Related papers (2024-08-06T01:20:12Z) - Large Language Models are Vulnerable to Bait-and-Switch Attacks for
Generating Harmful Content [33.99403318079253]
Even safe text coming from large language models can be turned into potentially dangerous content through Bait-and-Switch attacks.
The alarming efficacy of this approach highlights a significant challenge in developing reliable safety guardrails for LLMs.
arXiv Detail & Related papers (2024-02-21T16:46:36Z) - A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly [21.536079040559517]
Large Language Models (LLMs) have revolutionized natural language understanding and generation.
This paper explores the intersection of LLMs with security and privacy.
arXiv Detail & Related papers (2023-12-04T16:25:18Z) - Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations [38.437893814759086]
Large Language Models (LLMs) have shown remarkable success in various tasks, yet their safety and the risk of generating harmful content remain pressing concerns.
We propose the In-Context Attack (ICA) which employs harmful demonstrations to subvert LLMs, and the In-Context Defense (ICD) which bolsters model resilience through examples that demonstrate refusal to produce harmful responses.
arXiv Detail & Related papers (2023-10-10T07:50:29Z) - Evaluating the Instruction-Following Robustness of Large Language Models
to Prompt Injection [70.28425745910711]
Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following.
This capability brings with it the risk of prompt injection attacks.
We evaluate the robustness of instruction-following LLMs against such attacks.
arXiv Detail & Related papers (2023-08-17T06:21:50Z) - Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users.
Recent works have proposed algorithms to detect LLM-generated text and protect LLMs.
We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z) - Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard
Security Attacks [67.86285142381644]
Recent advances in instruction-following large language models amplify the dual-use risks for malicious purposes.
Dual-use is difficult to prevent as instruction-following capabilities now enable standard attacks from computer security.
We show that instruction-following LLMs can produce targeted malicious content, including hate speech and scams.
arXiv Detail & Related papers (2023-02-11T15:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.