Related papers: Trust Me, I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models

Trust Me, I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models

URL: http://arxiv.org/abs/2505.17089v1
Date: Tue, 20 May 2025 21:22:40 GMT
Title: Trust Me, I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models
Authors: Md Rafi Ur Rashid, Vishnu Asutosh Dasu, Ye Wang, Gang Tan, Shagufta Mehnaz,
Abstract summary: Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks.<n>Existing defenses often address only a single threat type or resort to rigid outright rejection.<n>This paper introduces Adrial Scenario Extrapolation (ASE), a novel inference-time framework that leverages Chain-of-Thought reasoning.
Score: 12.864404778567154
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) exhibit impressive capabilities, but remain susceptible to a growing spectrum of safety risks, including jailbreaks, toxic content, hallucinations, and bias. Existing defenses often address only a single threat type or resort to rigid outright rejection, sacrificing user experience and failing to generalize across diverse and novel attacks. This paper introduces Adversarial Scenario Extrapolation (ASE), a novel inference-time computation framework that leverages Chain-of-Thought (CoT) reasoning to simultaneously enhance LLM robustness and seamlessness. ASE guides the LLM through a self-generative process of contemplating potential adversarial scenarios and formulating defensive strategies before generating a response to the user query. Comprehensive evaluation on four adversarial benchmarks with four latest LLMs shows that ASE achieves near-zero jailbreak attack success rates and minimal toxicity, while slashing outright rejections to <4%. ASE outperforms six state-of-the-art defenses in robustness-seamlessness trade-offs, with 92-99% accuracy on adversarial Q&A and 4-10x lower bias scores. By transforming adversarial perception into an intrinsic cognitive process, ASE sets a new paradigm for secure and natural human-AI interaction.

Related papers

Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack [7.988475248750045]
Large Vision-Language Models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks.<n>We conduct a systematic representational analysis to uncover why conventional adversarial attacks can circumvent the safety mechanisms embedded in LVLMs.<n>We propose a novel two stage evaluation framework for adversarial attacks on LVLMs.
arXiv Detail & Related papers (2025-05-28T04:43:39Z)
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning [21.423429565221383]
Large language models (LLMs) are vital for a wide range of applications yet remain susceptible to jailbreak threats.<n>We propose a novel defense strategy, Safety Chain-of-Thought (SCoT), which harnesses the enhanced textitreasoning capabilities of LLMs for proactive assessment of harmful inputs.
arXiv Detail & Related papers (2025-01-31T14:45:23Z)
Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [51.51850981481236]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
Stepwise Reasoning Error Disruption Attack of LLMs [34.30455975290165]
Existing attacks on large language models (LLMs) are constrained by specific settings or lack of imperceptibility.<n>We propose the Stepwise rEasoning Error Disruption (SEED) attack, which subtly injects errors into prior reasoning steps to mislead the model into producing incorrect subsequent reasoning and final answers.
arXiv Detail & Related papers (2024-12-16T16:20:41Z)
Membership Inference Attacks Against In-Context Learning [26.57639819629732]
We present the first membership inference attack tailored for In-Context Learning (ICL) We propose four attack strategies tailored to various constrained scenarios. We investigate three potential defenses targeting data, instruction, and output.
arXiv Detail & Related papers (2024-09-02T17:23:23Z)
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs [13.317364896194903]
We propose a two-stage adversarial tuning framework to enhance Large Language Models' generalized defense capabilities. In the first stage, we introduce the hierarchical meta-universal adversarial prompt learning to efficiently generate token-level adversarial prompts. In the second stage, we propose the automatic adversarial prompt learning to iteratively refine semantic-level adversarial prompts.
arXiv Detail & Related papers (2024-06-07T15:37:15Z)
Rethinking Textual Adversarial Defense for Pre-trained Language Models [79.18455635071817]
A literature review shows that pre-trained language models (PrLMs) are vulnerable to adversarial attacks. We propose a novel metric (Degree of Anomaly) to enable current adversarial attack approaches to generate more natural and imperceptible adversarial examples. We show that our universal defense framework achieves comparable or even higher after-attack accuracy with other specific defenses.
arXiv Detail & Related papers (2022-07-21T07:51:45Z)
Characterizing the adversarial vulnerability of speech self-supervised learning [95.03389072594243]
We make the first attempt to investigate the adversarial vulnerability of such paradigm under the attacks from both zero-knowledge adversaries and limited-knowledge adversaries. The experimental results illustrate that the paradigm proposed by SUPERB is seriously vulnerable to limited-knowledge adversaries.
arXiv Detail & Related papers (2021-11-08T08:44:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.