Related papers: Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption

Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption

URL: http://arxiv.org/abs/2504.20769v1
Date: Tue, 29 Apr 2025 13:50:05 GMT
Title: Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption
Authors: Wenxiao Wang, Parsa Hosseini, Soheil Feizi,
Abstract summary: We show how a wide range of large language models exhibit significantly improved robustness against reference corruption using a simple method called chain-of-defensive-thought.<n> Empirically, the improvements can be astounding, especially given the simplicity and applicability of the method.
Score: 51.98089842456886
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chain-of-thought prompting has demonstrated great success in facilitating the reasoning abilities of large language models. In this work, we explore how these enhanced reasoning abilities can be exploited to improve the robustness of large language models in tasks that are not necessarily reasoning-focused. In particular, we show how a wide range of large language models exhibit significantly improved robustness against reference corruption using a simple method called chain-of-defensive-thought, where only a few exemplars with structured and defensive reasoning are provided as demonstrations. Empirically, the improvements can be astounding, especially given the simplicity and applicability of the method. For example, in the Natural Questions task, the accuracy of GPT-4o degrades from 60% to as low as 3% with standard prompting when 1 out of 10 references provided is corrupted with prompt injection attacks. In contrast, GPT-4o using chain-of-defensive-thought prompting maintains an accuracy of 50%.

Related papers

Structured Reasoning for Large Language Models [59.215789462977206]
We propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components.<n>SCR substantially improves reasoning efficiency and self-verification.<n>Compared with existing reasoning paradigms, it reduces output token length by up to 50%.
arXiv Detail & Related papers (2026-01-12T04:04:01Z)
Prompt Injection as an Emerging Threat: Evaluating the Resilience of Large Language Models [0.0]
This study proposes a unified framework for evaluating how resistant Large Language Models (LLMs) are to prompt injection attacks.<n>The framework defines three metrics such as the Resilience Degradation Index (RDI), Safety Compliance Coefficient ( SCC), and instructional Integrity Metric (IIM) to jointly measure robustness, safety, and semantic stability.
arXiv Detail & Related papers (2025-11-03T14:43:56Z)
Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models [0.0]
Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks.<n>While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear.<n>We leverage the CLEAR-Bias benchmark to investigate the adversarial robustness of RLMs to bias elicitation.
arXiv Detail & Related papers (2025-07-03T17:01:53Z)
ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs [54.154593699263074]
ProtoReasoning is a framework that enhances the reasoning ability of Large Reasoning Models.<n>ProtoReasoning transforms problems into corresponding prototype representations.<n>ProtoReasoning achieves 4.7% improvement over baseline models on logical reasoning.
arXiv Detail & Related papers (2025-06-18T07:44:09Z)
Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [51.51850981481236]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z)
Large Language Models are Contrastive Reasoners [8.427805316635318]
We show how contrastive prompting significantly improves the ability of large language models to perform complex reasoning.<n>Our method surpasses zero-shot CoT and few-shot CoT in most arithmetic and commonsense reasoning tasks.
arXiv Detail & Related papers (2024-03-13T03:15:05Z)
Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning [14.011140902511135]
In-context learning, a paradigm bridging the gap between pre-training and fine-tuning, has demonstrated high efficacy in several NLP tasks. Despite being widely applied, in-context learning is vulnerable to malicious attacks. We design a new backdoor attack method, named ICLAttack, to target large language models based on in-context learning.
arXiv Detail & Related papers (2024-01-11T14:38:19Z)
Contrastive Chain-of-Thought Prompting [74.10511560147293]
We propose contrastive chain of thought to enhance language model reasoning. Compared to the conventional chain of thought, our approach provides both valid and invalid reasoning demonstrations. Our experiments on reasoning benchmarks demonstrate that contrastive chain of thought can serve as a general enhancement of chain-of-thought prompting.
arXiv Detail & Related papers (2023-11-15T18:54:01Z)
Locally Differentially Private Document Generation Using Zero Shot Prompting [61.20953109732442]
We propose a locally differentially private mechanism called DP-Prompt to counter author de-anonymization attacks. When DP-Prompt is used with a powerful language model like ChatGPT (gpt-3.5), we observe a notable reduction in the success rate of de-anonymization attacks.
arXiv Detail & Related papers (2023-10-24T18:25:13Z)
Response: Emergent analogical reasoning in large language models [0.034530027457862]
GPT-3 fails to solve simplest variations of the original tasks, whereas human performance remains consistently high across all modified versions. To strengthen claims of humanlike reasoning such as zero-shot reasoning, it is important to develop approaches that rule out data memorization.
arXiv Detail & Related papers (2023-08-30T16:17:26Z)
How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness? [121.57551065856164]
We propose Robust Informative Fine-Tuning (RIFT) as a novel adversarial fine-tuning method from an information-theoretical perspective. RIFT encourages an objective model to retain the features learned from the pre-trained model throughout the entire fine-tuning process. Experimental results show that RIFT consistently outperforms the state-of-the-arts on two popular NLP tasks.
arXiv Detail & Related papers (2021-12-22T05:04:41Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.