Jailbreaking Proprietary Large Language Models using Word Substitution
Cipher
- URL: http://arxiv.org/abs/2402.10601v1
- Date: Fri, 16 Feb 2024 11:37:05 GMT
- Title: Jailbreaking Proprietary Large Language Models using Word Substitution
Cipher
- Authors: Divij Handa, Advait Chirmule, Bimal Gajera, Chitta Baral
- Abstract summary: We present jailbreaking prompts encoded using cryptographic techniques.
We present a mapping of unsafe words with safe words and ask the unsafe question using these mapped words.
Experimental results show an attack success rate (up to 59.42%) of our proposed jailbreaking approach on state-of-the-art proprietary models.
- Score: 35.36615140853107
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are aligned to moral and ethical guidelines but
remain susceptible to creative prompts called Jailbreak that can bypass the
alignment process. However, most jailbreaking prompts contain harmful questions
in the natural language (mainly English), which can be detected by the LLM
themselves. In this paper, we present jailbreaking prompts encoded using
cryptographic techniques. We first present a pilot study on the
state-of-the-art LLM, GPT-4, in decoding several safe sentences that have been
encrypted using various cryptographic techniques and find that a
straightforward word substitution cipher can be decoded most effectively.
Motivated by this result, we use this encoding technique for writing
jailbreaking prompts. We present a mapping of unsafe words with safe words and
ask the unsafe question using these mapped words. Experimental results show an
attack success rate (up to 59.42%) of our proposed jailbreaking approach on
state-of-the-art proprietary models including ChatGPT, GPT-4, and Gemini-Pro.
Additionally, we discuss the over-defensiveness of these models. We believe
that our work will encourage further research in making these LLMs more robust
while maintaining their decoding capabilities.
Related papers
- Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens [22.24239212756129]
Existing jailbreaking attacks require either human experts or leveraging complicated algorithms to craft prompts.
We introduce BOOST, a simple attack that leverages only the eos tokens.
Our findings uncover how fragile an LLM is against jailbreak attacks, motivating the development of strong safety alignment approaches.
arXiv Detail & Related papers (2024-05-31T07:41:03Z) - CodeChameleon: Personalized Encryption Framework for Jailbreaking Large
Language Models [49.60006012946767]
We propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics.
We conduct extensive experiments on 7 Large Language Models, achieving state-of-the-art average Attack Success Rate (ASR)
Remarkably, our method achieves an 86.6% ASR on GPT-4-1106.
arXiv Detail & Related papers (2024-02-26T16:35:59Z) - LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A
Vision Paper [16.078682415975337]
Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs)
This paper proposes a lightweight yet practical defense called SELFDEFEND.
It can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts.
arXiv Detail & Related papers (2024-02-24T05:34:43Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts [64.60375604495883]
We discover a system prompt leakage vulnerability in GPT-4V.
By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts.
We also evaluate the effect of modifying system prompts to defend against jailbreaking attacks.
arXiv Detail & Related papers (2023-11-15T17:17:39Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher [85.18213923151717]
Experimental results show certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains.
We propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability.
arXiv Detail & Related papers (2023-08-12T04:05:57Z) - Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks [12.540530764250812]
We propose a formalism and a taxonomy of known (and possible) jailbreaks.
We release a dataset of model outputs across 3700 jailbreak prompts over 4 tasks.
arXiv Detail & Related papers (2023-05-24T09:57:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.