LLM Censorship: A Machine Learning Challenge or a Computer Security
  Problem?
        - URL: http://arxiv.org/abs/2307.10719v1
- Date: Thu, 20 Jul 2023 09:25:02 GMT
- Title: LLM Censorship: A Machine Learning Challenge or a Computer Security
  Problem?
- Authors: David Glukhov, Ilia Shumailov, Yarin Gal, Nicolas Papernot, Vardan
  Papyan
- Abstract summary: We show that semantic censorship can be perceived as an undecidable problem.
We argue that the challenges extend beyond semantic censorship, as knowledgeable attackers can reconstruct impermissible outputs.
- Score: 52.71988102039535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Large language models (LLMs) have exhibited impressive capabilities in
comprehending complex instructions. However, their blind adherence to provided
instructions has led to concerns regarding risks of malicious use. Existing
defence mechanisms, such as model fine-tuning or output censorship using LLMs,
have proven to be fallible, as LLMs can still generate problematic responses.
Commonly employed censorship approaches treat the issue as a machine learning
problem and rely on another LM to detect undesirable content in LLM outputs. In
this paper, we present the theoretical limitations of such semantic censorship
approaches. Specifically, we demonstrate that semantic censorship can be
perceived as an undecidable problem, highlighting the inherent challenges in
censorship that arise due to LLMs' programmatic and instruction-following
capabilities. Furthermore, we argue that the challenges extend beyond semantic
censorship, as knowledgeable attackers can reconstruct impermissible outputs
from a collection of permissible ones. As a result, we propose that the problem
of censorship needs to be reevaluated; it should be treated as a security
problem which warrants the adaptation of security-based approaches to mitigate
potential risks.
 
      
        Related papers
        - Look Before You Leap: Enhancing Attention and Vigilance Regarding   Harmful Content with GuidelineLLM [53.79753074854936]
 Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks.
This vulnerability poses significant risks to real-world applications.
We propose a novel defensive paradigm called GuidelineLLM.
 arXiv  Detail & Related papers  (2024-12-10T12:42:33Z)
- LLMs know their vulnerabilities: Uncover Safety Gaps through Natural   Distribution Shifts [88.96201324719205]
 Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training.<n>We identify a new safety vulnerability in LLMs, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms.<n>We introduce a novel attack method, textitActorBreaker, which identifies actors related to toxic prompts within pre-training distribution.
 arXiv  Detail & Related papers  (2024-10-14T16:41:49Z)
- HiddenGuard: Fine-Grained Safe Generation with Specialized   Representation Router [42.222681564769076]
 We introduce HiddenGuard, a novel framework for fine-grained, safe generation in Large Language Models.
HiddenGuard incorporates Prism, which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content.
Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content.
 arXiv  Detail & Related papers  (2024-10-03T17:10:41Z)
- CoCA: Regaining Safety-awareness of Multimodal Large Language Models   with Constitutional Calibration [90.36429361299807]
 multimodal large language models (MLLMs) have demonstrated remarkable success in engaging in conversations involving visual inputs.
The integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs.
We introduce a technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution.
 arXiv  Detail & Related papers  (2024-09-17T17:14:41Z)
- Probing the Safety Response Boundary of Large Language Models via Unsafe   Decoding Path Generation [44.09578786678573]
 Large Language Models (LLMs) are implicit troublemakers.
LLMs could be used to gather harmful data or launch covert attacks.
We name this decoding strategy: Jailbreak Value Decoding (JVD)
 arXiv  Detail & Related papers  (2024-08-20T09:11:21Z)
- Compromising Embodied Agents with Contextual Backdoor Attacks [69.71630408822767]
 Large language models (LLMs) have transformed the development of embodied intelligence.
This paper uncovers a significant backdoor security threat within this process.
By poisoning just a few contextual demonstrations, attackers can covertly compromise the contextual environment of a black-box LLM.
 arXiv  Detail & Related papers  (2024-08-06T01:20:12Z)
- Large Language Models are Vulnerable to Bait-and-Switch Attacks for
  Generating Harmful Content [33.99403318079253]
 Even safe text coming from large language models can be turned into potentially dangerous content through Bait-and-Switch attacks.
The alarming efficacy of this approach highlights a significant challenge in developing reliable safety guardrails for LLMs.
 arXiv  Detail & Related papers  (2024-02-21T16:46:36Z)
- A Survey on Large Language Model (LLM) Security and Privacy: The Good,   the Bad, and the Ugly [21.536079040559517]
 Large Language Models (LLMs) have revolutionized natural language understanding and generation.
This paper explores the intersection of LLMs with security and privacy.
 arXiv  Detail & Related papers  (2023-12-04T16:25:18Z)
- Jailbreak and Guard Aligned Language Models with Only Few In-Context   Demonstrations [38.437893814759086]
 Large Language Models (LLMs) have shown remarkable success in various tasks, yet their safety and the risk of generating harmful content remain pressing concerns.
We propose the In-Context Attack (ICA) which employs harmful demonstrations to subvert LLMs, and the In-Context Defense (ICD) which bolsters model resilience through examples that demonstrate refusal to produce harmful responses.
 arXiv  Detail & Related papers  (2023-10-10T07:50:29Z)
- Evaluating the Instruction-Following Robustness of Large Language Models
  to Prompt Injection [70.28425745910711]
 Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following.
This capability brings with it the risk of prompt injection attacks.
We evaluate the robustness of instruction-following LLMs against such attacks.
 arXiv  Detail & Related papers  (2023-08-17T06:21:50Z)
- Red Teaming Language Model Detectors with Language Models [114.36392560711022]
 Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users.
Recent works have proposed algorithms to detect LLM-generated text and protect LLMs.
We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
 arXiv  Detail & Related papers  (2023-05-31T10:08:37Z)
- Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard
  Security Attacks [67.86285142381644]
 Recent advances in instruction-following large language models amplify the dual-use risks for malicious purposes.
Dual-use is difficult to prevent as instruction-following capabilities now enable standard attacks from computer security.
We show that instruction-following LLMs can produce targeted malicious content, including hate speech and scams.
 arXiv  Detail & Related papers  (2023-02-11T15:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.