Related papers: GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

URL: http://arxiv.org/abs/2308.06463v2
Date: Tue, 26 Mar 2024 04:23:12 GMT
Title: GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
Authors: Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, Zhaopeng Tu,
Abstract summary: Experimental results show certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains. We propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability.
Score: 85.18213923151717
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Safety lies at the core of the development of Large Language Models (LLMs). There is ample work on aligning LLMs with human ethics and preferences, including data filtering in pretraining, supervised fine-tuning, reinforcement learning from human feedback, and red teaming, etc. In this study, we discover that chat in cipher can bypass the safety alignment techniques of LLMs, which are mainly conducted in natural languages. We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers. CipherChat enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations. We use CipherChat to assess state-of-the-art LLMs, including ChatGPT and GPT-4 for different representative human ciphers across 11 safety domains in both English and Chinese. Experimental results show that certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains, demonstrating the necessity of developing safety alignment for non-natural languages. Notably, we identify that LLMs seem to have a ''secret cipher'', and propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability. SelfCipher surprisingly outperforms existing human ciphers in almost all cases. Our code and data will be released at https://github.com/RobustNLP/CipherChat.

Related papers

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages [57.059267233093465]
Large Language Models (LLMs) have transformed natural language processing, but their safety mechanisms remain under-explored in low-resource, multilingual settings.<n>We introduce textsfSGToxicGuard, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context.<n>We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails.
arXiv Detail & Related papers (2025-09-18T08:14:34Z)
Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization [4.020376901658977]
Large Language Models (LLMs) have transformed natural language understanding and generation.<n> cryptanalysis a critical area for data security and encryption has not yet been thoroughly explored in LLM evaluations.<n>We evaluate cryptanalytic potential of state of the art LLMs on encrypted texts generated using a range of cryptographic algorithms.
arXiv Detail & Related papers (2025-05-30T14:12:07Z)
CipherGuard: Compiler-aided Mitigation against Ciphertext Side-channel Attacks [30.992038220253797]
CipherGuard is a compiler-aided mitigation methodology to counteract ciphertext side channels with high efficiency and security. We demonstrate that CipherGuard can strengthen the security of various cryptographic implementations more efficiently than existing state-of-the-art defense mechanism, i.e., CipherFix.
arXiv Detail & Related papers (2025-02-19T03:22:36Z)
Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models [44.27350994698781]
We propose a novel framework to examine the generalizability of safety alignment. By treating LLMs as knowledge databases, we translate malicious queries in natural language into structured non-natural query language. We conduct extensive experiments on mainstream LLMs, and the results show that QueryAttack can achieve high attack success rates.
arXiv Detail & Related papers (2025-02-13T19:13:03Z)
A Machine Learning-Based Framework for Assessing Cryptographic Indistinguishability of Lightweight Block Ciphers [1.5953412143328967]
Indistinguishability is a fundamental principle of cryptographic security, crucial for securing data transmitted between Internet of Things (IoT) devices. This research investigates the ability of machine learning (ML) in assessing indistinguishability property in encryption systems. We introduce MIND-Crypt, a novel ML-based framework designed to assess the cryptographic indistinguishability of lightweight block ciphers.
arXiv Detail & Related papers (2024-05-30T04:40:13Z)
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion [117.178835165855]
This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs. Our studies reveal a new and universal safety vulnerability of these models against code input. We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
arXiv Detail & Related papers (2024-03-12T17:55:38Z)
CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models [49.60006012946767]
We propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. We conduct extensive experiments on 7 Large Language Models, achieving state-of-the-art average Attack Success Rate (ASR) Remarkably, our method achieves an 86.6% ASR on GPT-4-1106.
arXiv Detail & Related papers (2024-02-26T16:35:59Z)
When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers [33.41396323391102]
We introduce Attacks using Custom Encryptions (ACE), a jailbreaking technique that encodes malicious queries with novel ciphers. We also introduce Layered Attacks using Custom Encryptions (LACE), which applies multi-layer ciphers to amplify attack complexity. Our experiments reveal a critical trade-off: LLMs that are more capable of decoding ciphers are more vulnerable to these jailbreaking attacks.
arXiv Detail & Related papers (2024-02-16T11:37:05Z)
Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models [63.91178922306669]
We introduce Silent Guardian, a text protection mechanism against large language models (LLMs) By carefully modifying the text to be protected, TPE can induce LLMs to first sample the end token, thus directly terminating the interaction. We show that SG can effectively protect the target text under various configurations and achieve almost 100% protection success rate in some cases.
arXiv Detail & Related papers (2023-12-15T10:30:36Z)
Multilingual Jailbreak Challenges in Large Language Models [96.74878032417054]
In this study, we reveal the presence of multilingual jailbreak challenges within large language models (LLMs) We consider two potential risky scenarios: unintentional and intentional. We propose a novel textscSelf-Defense framework that automatically generates multilingual training data for safety fine-tuning.
arXiv Detail & Related papers (2023-10-10T09:44:06Z)
All Languages Matter: On the Multilingual Safety of Large Language Models [96.47607891042523]
We build the first multilingual safety benchmark for large language models (LLMs) XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT.
arXiv Detail & Related papers (2023-10-02T05:23:34Z)
Memorization for Good: Encryption with Autoregressive Language Models [8.645826579841692]
We propose the first symmetric encryption algorithm with autoregressive language models (SELM) We show that autoregressive LMs can encode arbitrary data into a compact real-valued vector (i.e., encryption) and then losslessly decode the vector to the original message (i.e. decryption) via random subspace optimization and greedy decoding.
arXiv Detail & Related papers (2023-05-15T05:42:34Z)
Enhancing Networking Cipher Algorithms with Natural Language [0.0]
Natural language processing is considered as the weakest link in a networking encryption model. This paper summarizes how languages can be integrated into symmetric encryption as a way to assist in the encryption of vulnerable streams.
arXiv Detail & Related papers (2022-06-22T09:05:52Z)
Can Sequence-to-Sequence Models Crack Substitution Ciphers? [15.898270650875158]
State-of-the-art decipherment methods use beam search and a neural language model to score candidate hypotheses for a given cipher. We show that our proposed method can decipher text without explicit language identification and can still be robust to noise.
arXiv Detail & Related papers (2020-12-30T17:16:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.