Related papers: Well, that escalated quickly: The Single-Turn Crescendo Attack (STCA)

Well, that escalated quickly: The Single-Turn Crescendo Attack (STCA)

URL: http://arxiv.org/abs/2409.03131v2
Date: Tue, 10 Sep 2024 21:53:46 GMT
Title: Well, that escalated quickly: The Single-Turn Crescendo Attack (STCA)
Authors: Alan Aqrawi, Arian Abbasi,
Abstract summary: This paper introduces a new method for adversarial attacks on large language models (LLMs) called the Single-Turn Crescendo Attack (STCA) Building on the multi-turn crescendo attack method introduced by Russinovich, Salem, and Eldan (2024), the STCA achieves similar outcomes in a single interaction.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This paper introduces a new method for adversarial attacks on large language models (LLMs) called the Single-Turn Crescendo Attack (STCA). Building on the multi-turn crescendo attack method introduced by Russinovich, Salem, and Eldan (2024), which gradually escalates the context to provoke harmful responses, the STCA achieves similar outcomes in a single interaction. By condensing the escalation into a single, well-crafted prompt, the STCA bypasses typical moderation filters that LLMs use to prevent inappropriate outputs. This technique reveals vulnerabilities in current LLMs and emphasizes the importance of stronger safeguards in responsible AI (RAI). The STCA offers a novel method that has not been previously explored.

Related papers

Strategic Deflection: Defending LLMs from Logit Manipulation [0.3903025330856988]
We introduce Strategic Deflection (SDeflection), a defense that redefines the Large Language Models' response to such advanced attacks.<n>Our experiments demonstrate that SDeflection significantly lowers Attack Success Rate (ASR) while maintaining model performance on benign queries.
arXiv Detail & Related papers (2025-07-29T18:46:56Z)
Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking [15.953888359667497]
jailbreak attacks based on prompt engineering have become a major safety threat.<n>This study introduces the concept of Defense Threshold Decay (DTD), revealing the potential safety impact caused by LLMs' benign generation.<n>We propose the Sugar-Coated Poison attack paradigm, which uses a "semantic reversal" strategy to craft benign inputs that are opposite in meaning to malicious intent.
arXiv Detail & Related papers (2025-04-08T03:57:09Z)
Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification [17.500701903902094]
Large Language Models (LLMs) are vulnerable to jailbreak attacks, which use crafted prompts to elicit toxic responses. This paper proposes DEEPALIGN, a robust defense framework that fine-tunes LLMs to progressively detoxify generated content.
arXiv Detail & Related papers (2025-03-14T08:32:12Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework. It reformulates harmful queries into benign reasoning tasks. We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
An indicator for effectiveness of text-to-image guardrails utilizing the Single-Turn Crescendo Attack (STCA) [0.0]
Single-Turn Crescendo Attack (STCA) is an innovative method designed to bypass the ethical safeguards of text-to-text AI models. This study provides a framework for researchers to rigorously evaluate the robustness of guardrails in text-to-image models.
arXiv Detail & Related papers (2024-11-27T19:09:16Z)
You Know What I'm Saying: Jailbreak Attack via Implicit Reference [22.520950422702757]
This study identifies a previously overlooked vulnerability, which we term Attack via Implicit Reference (AIR) AIR decomposes a malicious objective into permissible objectives and links them through implicit references within the context. Our experiments demonstrate AIR's effectiveness across state-of-the-art LLMs, achieving an attack success rate (ASR) exceeding 90% on most models.
arXiv Detail & Related papers (2024-10-04T18:42:57Z)
CLIP-Guided Generative Networks for Transferable Targeted Adversarial Attacks [52.29186466633699]
Transferable targeted adversarial attacks aim to mislead models into outputting adversary-specified predictions in black-box scenarios. textitsingle-target generative attacks train a generator for each target class to generate highly transferable perturbations. textbfCLIP-guided textbfGenerative textbfNetwork with textbfCross-attention modules (CGNC) to enhance multi-target attacks.
arXiv Detail & Related papers (2024-07-14T12:30:32Z)
Defending Large Language Models Against Attacks With Residual Stream Activation Analysis [0.0]
Large Language Models (LLMs) are vulnerable to adversarial threats. This paper presents an innovative defensive strategy, given white box access to an LLM. We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
arXiv Detail & Related papers (2024-06-05T13:06:33Z)
Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z)
Unveiling Vulnerability of Self-Attention [61.85150061213987]
Pre-trained language models (PLMs) are shown to be vulnerable to minor word changes. This paper studies the basic structure of transformer-based PLMs, the self-attention (SA) mechanism. We introduce textitS-Attend, a novel smoothing technique that effectively makes SA robust via structural perturbations.
arXiv Detail & Related papers (2024-02-26T10:31:45Z)
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text. Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z)
Hijacking Large Language Models via Adversarial In-Context Learning [8.15194326639149]
In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific downstream tasks. Existing attacks are either easy to detect, rely on external models, or lack specificity towards ICL. This work introduces a novel transferable attack against ICL to address these issues.
arXiv Detail & Related papers (2023-11-16T15:01:48Z)
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models [102.63973600144308]
Open-source large language models can be easily subverted to generate harmful content. Experiments across 8 models released by 5 different organizations demonstrate the effectiveness of shadow alignment attack. This study serves as a clarion call for a collective effort to overhaul and fortify the safety of open-source LLMs against malicious attackers.
arXiv Detail & Related papers (2023-10-04T16:39:31Z)
CARBEN: Composite Adversarial Robustness Benchmark [70.05004034081377]
This paper demonstrates how composite adversarial attack (CAA) affects the resulting image. It provides real-time inferences of different models, which will facilitate users' configuration of the parameters of the attack level. A leaderboard to benchmark adversarial robustness against CAA is also introduced.
arXiv Detail & Related papers (2022-07-16T01:08:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.