Related papers: Stepwise Reasoning Error Disruption Attack of LLMs

Stepwise Reasoning Error Disruption Attack of LLMs

URL: http://arxiv.org/abs/2412.11934v5
Date: Sat, 14 Jun 2025 06:06:53 GMT
Title: Stepwise Reasoning Error Disruption Attack of LLMs
Authors: Jingyu Peng, Maolin Wang, Xiangyu Zhao, Kai Zhang, Wanyu Wang, Pengyue Jia, Qidong Liu, Ruocheng Guo, Qi Liu,
Abstract summary: Existing attacks on large language models (LLMs) are constrained by specific settings or lack of imperceptibility.<n>We propose the Stepwise rEasoning Error Disruption (SEED) attack, which subtly injects errors into prior reasoning steps to mislead the model into producing incorrect subsequent reasoning and final answers.
Score: 34.30455975290165
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have made remarkable strides in complex reasoning tasks, but their safety and robustness in reasoning processes remain underexplored. Existing attacks on LLM reasoning are constrained by specific settings or lack of imperceptibility, limiting their feasibility and generalizability. To address these challenges, we propose the Stepwise rEasoning Error Disruption (SEED) attack, which subtly injects errors into prior reasoning steps to mislead the model into producing incorrect subsequent reasoning and final answers. Unlike previous methods, SEED is compatible with zero-shot and few-shot settings, maintains the natural reasoning flow, and ensures covert execution without modifying the instruction. Extensive experiments on four datasets across four different models demonstrate SEED's effectiveness, revealing the vulnerabilities of LLMs to disruptions in reasoning processes. These findings underscore the need for greater attention to the robustness of LLM reasoning to ensure safety in practical applications. Our code is available at: https://github.com/Applied-Machine-Learning-Lab/SEED-Attack.

Related papers

Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation [0.3495246564946556]
Large Language Models (LLMs) have achieved remarkable success in tasks requiring complex reasoning.<n>Do these models truly reason, or do they merely exploit shallow statistical patterns?<n>We introduce Chain-of-Code Collapse, where we investigate the robustness of reasoning LLMs by introducing a suite of semantically faithful yet adversarially structured prompt perturbations.
arXiv Detail & Related papers (2025-06-08T02:43:46Z)
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment [4.379304291229695]
We introduce Refusal-Aware Adaptive Injection (RAAI), a training-free, and model-agnostic framework that repurposes LLM attack techniques.<n> RAAI works by detecting internal refusal signals and adaptively injecting predefined phrases to elicit harmful, yet fluent, completions.<n>Our experiments show RAAI effectively jailbreaks LLMs, increasing the harmful response rate from a baseline of 2.15% to up to 61.04% on average.
arXiv Detail & Related papers (2025-06-07T08:19:01Z)
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction [68.6543680065379]
Large language models (LLMs) are vulnerable to prompt injection attacks.<n>We propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-04-29T07:13:53Z)
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models [50.34706204154244]
Acquiring reasoning capabilities catastrophically degrades inherited safety alignment. Certain scenarios suffer 25 times higher attack rates. Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction.
arXiv Detail & Related papers (2025-04-09T06:53:23Z)
Exploring LLM Reasoning Through Controlled Prompt Variations [0.9217021281095907]
We evaluate how well state-of-the-art models maintain logical consistency and correctness when confronted with four categories of prompt perturbations. Our experiments, conducted on thirteen open-source and closed-source LLMs, reveal that introducing irrelevant context within the model's context window significantly degrades performance. Certain perturbations inadvertently trigger chain-of-thought-like reasoning behaviors, even without explicit prompting.
arXiv Detail & Related papers (2025-04-02T20:18:50Z)
Patterns Over Principles: The Fragility of Inductive Reasoning in LLMs under Noisy Observations [43.491353243991284]
We introduce Robust Rule Induction, a task that evaluates large language models' capability in inferring rules from data fused with noisy examples. We also propose Sample-steered Rule Refinement (SRR), a method enhancing reasoning stability via observation diversification and execution-guided feedback. Our findings challenge LLMs' reasoning, revealing susceptibility to hypothesis drift and pattern overfitting, while providing empirical evidence critical for developing human-like inductive systems.
arXiv Detail & Related papers (2025-02-22T10:03:19Z)
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [48.28847964704554]
Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks. We propose a novel approach for continuous-space reasoning that does not require modifying the underlying LLM.
arXiv Detail & Related papers (2025-02-17T18:52:29Z)
Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks.<n>We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction.<n>We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z)
Counterfactual Explainable Incremental Prompt Attack Analysis on Large Language Models [32.03992137755351]
This study sheds light on the imperative need to bolster safety and privacy measures in large language models (LLMs) We propose Counterfactual Explainable Incremental Prompt Attack (CEIPA), a novel technique where we guide prompts in a specific manner to quantitatively measure attack effectiveness.
arXiv Detail & Related papers (2024-07-12T14:26:14Z)
Evaluating Uncertainty-based Failure Detection for Closed-Loop LLM Planners [10.746821861109176]
Large Language Models (LLMs) have witnessed remarkable performance as zero-shot task planners for robotic tasks. However, the open-loop nature of previous works makes LLM-based planning error-prone and fragile. In this work, we introduce a framework for closed-loop LLM-based planning called KnowLoop, backed by an uncertainty-based MLLMs failure detector.
arXiv Detail & Related papers (2024-06-01T12:52:06Z)
Resilience of Large Language Models for Noisy Instructions [38.25524275497566]
Large language models (LLMs) have emerged as powerful tools for interpreting human commands and generating text across various tasks. This study investigates the resilience of LLMs against five common types of disruptions including ASR (Automatic Speech Recognition) errors, OCR (Optical Character Recognition) errors, grammatical mistakes, and distractive content. Our findings reveal that while some LLMs show a degree of resistance to certain types of noise, their overall performance significantly suffers.
arXiv Detail & Related papers (2024-04-15T12:55:08Z)
ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding [89.0074567748505]
We present reverse prompt contrastive decoding (ROSE), a simple-yet-effective method to boost the safety of existing instruction-tuned LLMs without any additional training. Experiments on 6 safety and 2 general-purpose tasks show that, our ROSE not only brings consistent and significant safety improvements (up to +13.8% safety score) upon 5 types of instruction-tuned LLMs, but also benefits the general-purpose ability of LLMs.
arXiv Detail & Related papers (2024-02-19T06:58:42Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis [127.85293480405082]
The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges. Existing alignment methods usually direct LLMs toward the favorable outcomes by utilizing human-annotated, flawless instruction-response pairs. This study proposes a novel alignment technique based on mistake analysis, which deliberately exposes LLMs to erroneous content to learn the reasons for mistakes and how to avoid them.
arXiv Detail & Related papers (2023-10-16T14:59:10Z)
Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection [70.28425745910711]
Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following. This capability brings with it the risk of prompt injection attacks. We evaluate the robustness of instruction-following LLMs against such attacks.
arXiv Detail & Related papers (2023-08-17T06:21:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.