Thought Purity: Defense Paradigm For Chain-of-Thought Attack
- URL: http://arxiv.org/abs/2507.12314v1
- Date: Wed, 16 Jul 2025 15:09:13 GMT
- Title: Thought Purity: Defense Paradigm For Chain-of-Thought Attack
- Authors: Zihao Xue, Zhen Bi, Long Ma, Zhenlin Hu, Yan Wang, Zhenfang Liu, Qing Sheng, Jie Xiao, Jungang Lou,
- Abstract summary: We propose Thought Purity, a defense paradigm that strengthens resistance to malicious content while preserving operational efficacy.<n>Our approach establishes the first comprehensive defense mechanism against CoTA vulnerabilities in reinforcement learning-aligned reasoning systems.
- Score: 14.92561128881555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While reinforcement learning-trained Large Reasoning Models (LRMs, e.g., Deepseek-R1) demonstrate advanced reasoning capabilities in the evolving Large Language Models (LLMs) domain, their susceptibility to security threats remains a critical vulnerability. This weakness is particularly evident in Chain-of-Thought (CoT) generation processes, where adversarial methods like backdoor prompt attacks can systematically subvert the model's core reasoning mechanisms. The emerging Chain-of-Thought Attack (CoTA) reveals this vulnerability through exploiting prompt controllability, simultaneously degrading both CoT safety and task performance with low-cost interventions. To address this compounded security-performance vulnerability, we propose Thought Purity (TP): a defense paradigm that systematically strengthens resistance to malicious content while preserving operational efficacy. Our solution achieves this through three synergistic components: (1) a safety-optimized data processing pipeline (2) reinforcement learning-enhanced rule constraints (3) adaptive monitoring metrics. Our approach establishes the first comprehensive defense mechanism against CoTA vulnerabilities in reinforcement learning-aligned reasoning systems, significantly advancing the security-functionality equilibrium for next-generation AI architectures.
Related papers
- ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments [18.198349215500183]
ReasoningGuard injects timely safety aha moments to steer harmless while helpful reasoning processes.<n>Our approach outperforms seven existing safeguards, achieving state-of-the-art safety defenses.
arXiv Detail & Related papers (2025-08-06T08:35:10Z) - Beyond Algorithmic Proofs: Towards Implementation-Level Provable Security [1.338174941551702]
We present Implementation-Level Provable Security, a new paradigm that defines security in terms of structurally verifiable resilience against real-world attack surfaces during deployment.<n>We present SEER (Secure and Efficient Encryption-based Erasure via Ransomware), a file destruction system that repurposes and reinforces the encryption core of Babuk ransomware.
arXiv Detail & Related papers (2025-08-02T01:58:06Z) - Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - Concept Enhancement Engineering: A Lightweight and Efficient Robust Defense Against Jailbreak Attacks in Embodied AI [19.094809384824064]
Embodied Intelligence (EI) systems integrated with large language models (LLMs) face significant security risks.<n>Traditional defense strategies, such as input filtering and output monitoring, often introduce high computational overhead.<n>We propose Concept Enhancement Engineering (CEE) to enhance the safety of embodied LLMs by dynamically steering their internal activations.
arXiv Detail & Related papers (2025-04-15T03:50:04Z) - Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense [90.71884758066042]
Large vision-language models (LVLMs) introduce a unique vulnerability: susceptibility to malicious attacks via visual inputs.<n>We propose ESIII (Embedding Security Instructions Into Images), a novel methodology for transforming the visual space from a source of vulnerability into an active defense mechanism.
arXiv Detail & Related papers (2025-03-14T17:39:45Z) - Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking [26.812138599896997]
We propose Reasoning-to-Defend (R2D), a novel training paradigm that integrates a safety-aware reasoning mechanism into Large Language Models' generation.<n>R2D forms safety pivot tokens as indicators of the safety status of responses.<n>We show that R2D effectively mitigates various attacks and improves overall safety, while maintaining the original performances.
arXiv Detail & Related papers (2025-02-18T15:48:46Z) - To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models [56.19026073319406]
Large Reasoning Models (LRMs) are designed to solve complex tasks by generating explicit reasoning traces before producing final answers.<n>We reveal a critical vulnerability in LRMs -- termed Unthinking -- wherein the thinking process can be bypassed by manipulating special tokens.<n>In this paper, we investigate this vulnerability from both malicious and beneficial perspectives.
arXiv Detail & Related papers (2025-02-16T10:45:56Z) - In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models [104.94706600050557]
Text-to-image (T2I) models have shown remarkable progress, but their potential to generate harmful content remains a critical concern in the ML community.<n>We propose ICER, a novel red-teaming framework that generates interpretable and semantic meaningful problematic prompts.<n>Our work provides crucial insights for developing more robust safety mechanisms in T2I systems.
arXiv Detail & Related papers (2024-11-25T04:17:24Z) - Unlearning Backdoor Threats: Enhancing Backdoor Defense in Multimodal Contrastive Learning via Local Token Unlearning [49.242828934501986]
Multimodal contrastive learning has emerged as a powerful paradigm for building high-quality features.
backdoor attacks subtly embed malicious behaviors within the model during training.
We introduce an innovative token-based localized forgetting training regime.
arXiv Detail & Related papers (2024-03-24T18:33:15Z) - Policy Smoothing for Provably Robust Reinforcement Learning [109.90239627115336]
We study the provable robustness of reinforcement learning against norm-bounded adversarial perturbations of the inputs.
We generate certificates that guarantee that the total reward obtained by the smoothed policy will not fall below a certain threshold under a norm-bounded adversarial of perturbation the input.
arXiv Detail & Related papers (2021-06-21T21:42:08Z) - SoK: Certified Robustness for Deep Neural Networks [13.10665264010575]
Recent studies have shown that deep neural networks (DNNs) are vulnerable to adversarial attacks.
In this paper, we systematize certifiably robust approaches and related practical and theoretical implications.
We also provide the first comprehensive benchmark on existing robustness verification and training approaches on different datasets.
arXiv Detail & Related papers (2020-09-09T07:00:55Z) - A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems.
This paper proposes a self-supervised adversarial training mechanism in the input space.
It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.