Related papers: Enhancing Adversarial Resistance in LLMs with Recursion

Related papers

Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety [59.01189713115365]
We evaluate the impact of explicitly specifying extensive safety codes versus demonstrating them through illustrative cases.<n>We find that referencing explicit codes inconsistently improves harmlessness and systematically degrades helpfulness.<n>We propose CADA, a case-augmented deliberative alignment method for LLMs utilizing reinforcement learning on self-generated safety reasoning chains.
arXiv Detail & Related papers (2026-01-12T21:08:46Z)
Breaking to Build: A Threat Model of Prompt-Based Attacks for Securing LLMs [0.0]
Large Language Models (LLMs) have introduced critical security challenges.<n> adversarial actors can manipulate input prompts to cause significant harm and circumvent safety alignments.<n>This survey aims to inform the research community's efforts in building the next generation of secure LLMs.
arXiv Detail & Related papers (2025-09-04T18:59:07Z)
Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z)
On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks [23.95254828487318]
We present the first comprehensive study on the robustness of verbal confidence under adversarial attacks.<n>We introduce a novel framework for attacking verbal confidence scores through both perturbation and jailbreak-based methods.<n>Our findings underscore the urgent need to design more robust mechanisms for confidence expression in large language models.
arXiv Detail & Related papers (2025-07-09T02:19:46Z)
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models [83.80177564873094]
We propose a unified multimodal universal jailbreak attack framework.<n>We evaluate the undesirable context generation of MLLMs like LLaVA, Yi-VL, MiniGPT4, MiniGPT-v2, and InstructBLIP.<n>This study underscores the urgent need for robust safety measures in MLLMs.
arXiv Detail & Related papers (2025-06-02T04:33:56Z)
Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics [5.384257830522198]
Large Language Models (LLMs) in critical applications have introduced severe reliability and security risks. These vulnerabilities have been weaponized by malicious actors, leading to unauthorized access, widespread misinformation, and compromised system integrity. We introduce a novel approach to detecting abnormal behaviors in LLMs via hidden state forensics.
arXiv Detail & Related papers (2025-04-01T05:58:14Z)
Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking [26.812138599896997]
We propose Reasoning-to-Defend (R2D), a novel training paradigm that integrates safety reflections of queries and responses into LLM's generation process. R2D effectively mitigates various attacks and improves overall safety, highlighting the substantial potential of safety-aware reasoning in strengthening LLMs' robustness against jailbreaks.
arXiv Detail & Related papers (2025-02-18T15:48:46Z)
Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning [21.423429565221383]
Large language models (LLMs) are vital for a wide range of applications yet remain susceptible to jailbreak threats. We propose a novel defense strategy, Safety Chain-of-Thought (SCoT), which harnesses the enhanced textitreasoning capabilities of LLMs for proactive assessment of harmful inputs.
arXiv Detail & Related papers (2025-01-31T14:45:23Z)
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks. This vulnerability poses significant risks to real-world applications. We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z)
Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation [4.241100280846233]
AI agents, powered by large language models (LLMs), have transformed human-computer interactions by enabling seamless, natural, and context-aware communication.<n>This paper investigates a critical vulnerability: adversarial attacks targeting the LLM core within AI agents.
arXiv Detail & Related papers (2024-12-05T18:38:30Z)
Global Challenge for Safe and Secure LLMs Track 1 [57.08717321907755]
The Global Challenge for Safe and Secure Large Language Models (LLMs) is a pioneering initiative organized by AI Singapore (AISG) and the CyberSG R&D Programme Office (CRPO) This paper introduces the Global Challenge for Safe and Secure Large Language Models (LLMs), a pioneering initiative organized by AI Singapore (AISG) and the CyberSG R&D Programme Office (CRPO) to foster the development of advanced defense mechanisms against automated jailbreaking attacks.
arXiv Detail & Related papers (2024-11-21T08:20:31Z)
Jailbreaking and Mitigation of Vulnerabilities in Large Language Models [4.564507064383306]
Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation. Despite these advancements, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks. This review analyzes the state of research on these vulnerabilities and presents available defense strategies.
arXiv Detail & Related papers (2024-10-20T00:00:56Z)
SoK: Prompt Hacking of Large Language Models [5.056128048855064]
The safety and robustness of large language models (LLMs) based applications remain critical challenges in artificial intelligence. We offer a comprehensive and systematic overview of three distinct types of prompt hacking: jailbreaking, leaking, and injection. We propose a novel framework that categorizes LLM responses into five distinct classes, moving beyond the traditional binary classification.
arXiv Detail & Related papers (2024-10-16T01:30:41Z)
Tamper-Resistant Safeguards for Open-Weight LLMs [57.90526233549399]
We develop a method for building tamper-resistant safeguards into open-weight LLMs. We find that our method greatly improves tamper-resistance while preserving benign capabilities. Our results demonstrate that tamper-resistance is a tractable problem.
arXiv Detail & Related papers (2024-08-01T17:59:12Z)
Purple-teaming LLMs with Adversarial Defender Training [57.535241000787416]
We present Purple-teaming LLMs with Adversarial Defender training (PAD) PAD is a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques. PAD significantly outperforms existing baselines in both finding effective attacks and establishing a robust safe guardrail.
arXiv Detail & Related papers (2024-07-01T23:25:30Z)
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks. This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z)
Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Applications [0.0]
This paper introduces the 'Signed-Prompt' method as a novel solution for prompt injection attacks. The study involves signing sensitive instructions within command segments by authorized users, enabling the LLM to discern trusted instruction sources. Experiments demonstrate the effectiveness of the Signed-Prompt method, showing substantial resistance to various types of prompt injection attacks.
arXiv Detail & Related papers (2024-01-15T11:44:18Z)
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities. Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content. We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.