TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization
- URL: http://arxiv.org/abs/2511.18581v2
- Date: Wed, 26 Nov 2025 02:49:38 GMT
- Title: TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization
- Authors: Yanting Wang, Runpeng Geng, Jinghui Chen, Minhao Cheng, Jinyuan Jia,
- Abstract summary: We introduce TASO, a novel jailbreak method that optimize both a template and a suffix in an alternating manner.<n>We evaluate the effectiveness of TASO on benchmark datasets on 24 leading LLMs.
- Score: 52.01940078632388
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many recent studies showed that LLMs are vulnerable to jailbreak attacks, where an attacker can perturb the input of an LLM to induce it to generate an output for a harmful question. In general, existing jailbreak techniques either optimize a semantic template intended to induce the LLM to produce harmful outputs or optimize a suffix that leads the LLM to initiate its response with specific tokens (e.g., "Sure"). In this work, we introduce TASO (Template and Suffix Optimization), a novel jailbreak method that optimizes both a template and a suffix in an alternating manner. Our insight is that suffix optimization and template optimization are complementary to each other: suffix optimization can effectively control the first few output tokens but cannot control the overall quality of the output, while template optimization provides guidance for the entire output but cannot effectively control the initial tokens, which significantly impact subsequent responses. Thus, they can be combined to improve the attack's effectiveness. We evaluate the effectiveness of TASO on benchmark datasets (including HarmBench and AdvBench) on 24 leading LLMs (including models from the Llama family, OpenAI, and DeepSeek). The results demonstrate that TASO can effectively jailbreak existing LLMs. We hope our work can inspire future studies in exploring this direction.
Related papers
- GEM: Empowering LLM for both Embedding Generation and Language Understanding [11.081595808236239]
We propose Generative Embedding large language Model (GEM) to generate high-quality text embeddings.<n>Our method inserts new special token(s) into a text body, and generates summarization embedding of the text by manipulating the attention mask.<n>Our results indicate that our approach can empower LLMs with state-of-the-art text embedding capabilities while maintaining their original NLP performance.
arXiv Detail & Related papers (2025-06-04T18:02:07Z) - LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs [13.432303050813864]
We introduce LARGO, a novel latent self-reflection attack that generates fluent jailbreaking prompts.<n>On benchmarks like AdvBench and JailbreakBench, LARGO surpasses leading jailbreaking techniques, including AutoDAN, by 44 points in attack success rate.
arXiv Detail & Related papers (2025-05-16T04:12:16Z) - Dagger Behind Smile: Fool LLMs with a Happy Ending Story [6.850563535528862]
Happy Ending Attack wraps up a malicious request in a scenario template and fools LLMs into jailbreaking either immediately or at a follow-up malicious request.<n>Our HEA can successfully jailbreak on state-of-the-art LLMs, including GPT-4o, Llama3-70b, Gemini-pro, and achieves 88.79% attack success rate on average.
arXiv Detail & Related papers (2025-01-19T13:39:51Z) - Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [50.980446687774645]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability.<n>Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs.<n>It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z) - An Optimizable Suffix Is Worth A Thousand Templates: Efficient Black-box Jailbreaking without Affirmative Phrases via LLM as Optimizer [33.67942887761857]
We present ECLIPSE, a novel and efficient black-box jailbreaking method utilizing optimizable suffixes.<n>We employ task prompts to translate jailbreaking goals into natural language instructions, which guides the LLM to generate adversarial suffixes for malicious queries.<n>ECLIPSE achieves an average attack success rate (ASR) of 0.92 across three open-source LLMs and GPT-3.5-Turbo, significantly surpassing GCG in 2.4 times.
arXiv Detail & Related papers (2024-08-21T03:35:24Z) - One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs)<n>We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z) - Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization [12.418844515095035]
Large Language Models (LLMs) tend to produce inaccurate responses to specific queries.<n> incorrect tokenization is the critical point that hinders LLMs in understanding the input precisely.<n>We construct an adversarial dataset, named as $textbfADT (Adrial dataset for Tokenizer)$, which draws upon the vocabularies of various open-source LLMs to challenge LLMs' tokenization.
arXiv Detail & Related papers (2024-05-27T11:39:59Z) - Are Large Language Models Good Prompt Optimizers? [65.48910201816223]
We conduct a study to uncover the actual mechanism of LLM-based Prompt Optimization.
Our findings reveal that the LLMs struggle to identify the true causes of errors during reflection, tending to be biased by their own prior knowledge.
We introduce a new "Automatic Behavior Optimization" paradigm, which directly optimize the target model's behavior in a more controllable manner.
arXiv Detail & Related papers (2024-02-03T09:48:54Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - Guiding Large Language Models via Directional Stimulus Prompting [114.84930073977672]
We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs.
Instead of directly adjusting LLMs, our method employs a small tunable policy model to generate an auxiliary directional stimulus prompt for each input instance.
arXiv Detail & Related papers (2023-02-22T17:44:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.