Related papers: PUZZLED: Jailbreaking LLMs through Word-Based Puzzles

PUZZLED: Jailbreaking LLMs through Word-Based Puzzles

URL: http://arxiv.org/abs/2508.01306v1
Date: Sat, 02 Aug 2025 10:36:01 GMT
Title: PUZZLED: Jailbreaking LLMs through Word-Based Puzzles
Authors: Yelim Ahn, Jaejin Lee,
Abstract summary: We introduce PUZZLED, a novel jailbreak method that leverages the large language models' reasoning capabilities.<n>We design three puzzle types-word search, anagram, and crossword-that are familiar to humans but cognitively demanding for LLMs.<n>We observe a high average attack success rate (ASR) of 88.8%, specifically 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet.
Score: 1.8538788075154355
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) are increasingly deployed across diverse domains, ensuring their safety has become a critical concern. In response, studies on jailbreak attacks have been actively growing. Existing approaches typically rely on iterative prompt engineering or semantic transformations of harmful instructions to evade detection. In this work, we introduce PUZZLED, a novel jailbreak method that leverages the LLM's reasoning capabilities. It masks keywords in a harmful instruction and presents them as word puzzles for the LLM to solve. We design three puzzle types-word search, anagram, and crossword-that are familiar to humans but cognitively demanding for LLMs. The model must solve the puzzle to uncover the masked words and then proceed to generate responses to the reconstructed harmful instruction. We evaluate PUZZLED on five state-of-the-art LLMs and observe a high average attack success rate (ASR) of 88.8%, specifically 96.5% on GPT-4.1 and 92.3% on Claude 3.7 Sonnet. PUZZLED is a simple yet powerful attack that transforms familiar puzzles into an effective jailbreak strategy by harnessing LLMs' reasoning capabilities.

Related papers

JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing [2.3822909465087228]
JBFuzz is inspired by the success of fuzzing for detecting bugs/vulnerabilities in software.<n>We find that JBFuzz successfully jailbreaks all LLMs for various harmful/unethical questions, with an average attack success rate of 99%.
arXiv Detail & Related papers (2025-03-12T01:52:17Z)
Playing Language Game with LLMs Leads to Jailbreaking [18.63358696510664]
We introduce two novel jailbreak methods based on mismatched language games and custom language games.<n>We demonstrate the effectiveness of our methods, achieving success rates of 93% on GPT-4o, 89% on GPT-4o-mini and 83% on Claude-3.5-Sonnet.
arXiv Detail & Related papers (2024-11-16T13:07:13Z)
Multi-round jailbreak attack on large language models [2.540971544359496]
We introduce a multi-round jailbreak approach to better understand "jailbreak" attacks. This method can rewrite the dangerous prompts, decomposing them into a series of less harmful sub-questions. Our experimental results show a 94% success rate on the llama2-7B.
arXiv Detail & Related papers (2024-10-15T12:08:14Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
FlipAttack: Jailbreak LLMs via Flipping [63.871087708946476]
This paper proposes a simple yet effective jailbreak attack named FlipAttack against black-box LLMs. We reveal that LLMs tend to understand the text from left to right and find that they struggle to comprehend the text when noise is added to the left side. Motivated by these insights, we propose to disguise the harmful prompt by constructing left-side noise merely based on the prompt itself, then generalize this idea to 4 flipping modes.
arXiv Detail & Related papers (2024-10-02T08:41:23Z)
Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings [57.136748215262884]
We introduce ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data.<n>We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary.<n>Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z)
Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology [12.584928288798658]
This study builds a psychological perspective on the intrinsic decision-making logic of Large Language Models (LLMs) We propose an automatic black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique.
arXiv Detail & Related papers (2024-02-24T02:27:55Z)
Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues [16.97760778679782]
We propose an indirect jailbreak attack approach, Puzzler, which can bypass the LLM's defense strategy and obtain malicious response. Our experiments show that Puzzler achieves a query success rate of 96.6% on closed-source LLMs. When tested against the state-of-the-art jailbreak detection approaches, Puzzler proves to be more effective at evading detection compared to baselines.
arXiv Detail & Related papers (2024-02-14T11:11:51Z)
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking [60.78524314357671]
We investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of large language models (LLMs) Our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights. Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload.
arXiv Detail & Related papers (2023-11-16T11:52:22Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks. We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.