Related papers: Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

URL: http://arxiv.org/abs/2310.19737v1
Date: Mon, 30 Oct 2023 17:01:02 GMT
Title: Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
Authors: Leo Schwinn and David Dobre and Stephan G\"unnemann and Gauthier Gidel
Abstract summary: Flawed robustness evaluations slow research and provide a false sense of security. We provide a first set of prerequisites to improve the robustness assessment of new approaches. We demonstrate on a recently proposed defense that it is easy to overestimate the robustness of a new approach.
Score: 21.222184849635823
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Over the past decade, there has been extensive research aimed at enhancing the robustness of neural networks, yet this problem remains vastly unsolved. Here, one major impediment has been the overestimation of the robustness of new defense approaches due to faulty defense evaluations. Flawed robustness evaluations necessitate rectifications in subsequent works, dangerously slowing down the research and providing a false sense of security. In this context, we will face substantial challenges associated with an impending adversarial arms race in natural language processing, specifically with closed-source Large Language Models (LLMs), such as ChatGPT, Google Bard, or Anthropic's Claude. We provide a first set of prerequisites to improve the robustness assessment of new approaches and reduce the amount of faulty evaluations. Additionally, we identify embedding space attacks on LLMs as another viable threat model for the purposes of generating malicious content in open-sourced models. Finally, we demonstrate on a recently proposed defense that, without LLM-specific best practices in place, it is easy to overestimate the robustness of a new approach.

Related papers

Benchmarking Misuse Mitigation Against Covert Adversaries [80.74502950627736]
Existing language model safety evaluations focus on overt attacks and low-stakes tasks.<n>We develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses.<n>Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.
arXiv Detail & Related papers (2025-06-06T17:33:33Z)
MISLEADER: Defending against Model Extraction with Ensembles of Distilled Models [56.09354775405601]
Model extraction attacks aim to replicate the functionality of a black-box model through query access.<n>Most existing defenses presume that attacker queries have out-of-distribution (OOD) samples, enabling them to detect and disrupt suspicious inputs.<n>We propose MISLEADER, a novel defense strategy that does not rely on OOD assumptions.
arXiv Detail & Related papers (2025-06-03T01:37:09Z)
Security Concerns for Large Language Models: A Survey [3.175227858236288]
Large Language Models (LLMs) have caused a revolution in natural language processing, but their capabilities also introduce new security vulnerabilities.<n>We provide a comprehensive overview of the emerging security concerns around LLMs, categorizing threats into prompt injection and jailbreaking, adversarial attacks such as input perturbations and data poisoning, and worrisome risks inherent in autonomous LLM agents.<n>We conclude by emphasizing the importance of advancing robust, multi-layered security strategies to ensure LLMs are safe and beneficial.
arXiv Detail & Related papers (2025-05-24T22:22:43Z)
Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives [52.863024096759816]
Misaligned research objectives have hindered progress in adversarial robustness research over the past decade. We argue that realigned objectives are necessary for meaningful progress in adversarial alignment.
arXiv Detail & Related papers (2025-02-17T15:28:40Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework. It reformulates harmful queries into benign reasoning tasks. We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
arXiv Detail & Related papers (2025-02-03T18:59:01Z)
On Evaluating the Durability of Safeguards for Open-Weight LLMs [80.36750298080275]
We discuss whether technical safeguards can impede the misuse of large language models (LLMs) We show that even evaluating these defenses is exceedingly difficult and can easily mislead audiences into thinking that safeguards are more durable than they really are. We suggest future research carefully cabin claims to more constrained, well-defined, and rigorously examined threat models.
arXiv Detail & Related papers (2024-12-10T01:30:32Z)
The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense [56.32083100401117]
We investigate why Vision Large Language Models (VLLMs) are prone to jailbreak attacks. We then make a key observation: existing defense mechanisms suffer from an textbfover-prudence problem. We find that the two representative evaluation methods for jailbreak often exhibit chance agreement.
arXiv Detail & Related papers (2024-11-13T07:57:19Z)
Jailbreaking and Mitigation of Vulnerabilities in Large Language Models [4.564507064383306]
Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation. Despite these advancements, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks. This review analyzes the state of research on these vulnerabilities and presents available defense strategies.
arXiv Detail & Related papers (2024-10-20T00:00:56Z)
CALoR: Towards Comprehensive Model Inversion Defense [43.2642796582236]
Model Inversion Attacks (MIAs) aim at recovering privacy-sensitive training data from the knowledge encoded in released machine learning models. Recent advances in the MIA field have significantly enhanced the attack performance under multiple scenarios. We propose a robust defense mechanism, integrating Confidence Adaptation and Low-Rank compression.
arXiv Detail & Related papers (2024-10-08T08:44:01Z)
Recent Advances in Attack and Defense Approaches of Large Language Models [27.271665614205034]
Large Language Models (LLMs) have revolutionized artificial intelligence and machine learning through their advanced text processing and generating capabilities. Their widespread deployment has raised significant safety and reliability concerns. This paper reviews current research on LLM vulnerabilities and threats, and evaluates the effectiveness of contemporary defense mechanisms.
arXiv Detail & Related papers (2024-09-05T06:31:37Z)
Tamper-Resistant Safeguards for Open-Weight LLMs [57.90526233549399]
We develop a method for building tamper-resistant safeguards into open-weight LLMs. We find that our method greatly improves tamper-resistance while preserving benign capabilities. Our results demonstrate that tamper-resistance is a tractable problem.
arXiv Detail & Related papers (2024-08-01T17:59:12Z)
A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends [78.3201480023907]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a wide range of multimodal understanding and reasoning tasks. The vulnerability of LVLMs is relatively underexplored, posing potential security risks in daily usage. In this paper, we provide a comprehensive review of the various forms of existing LVLM attacks.
arXiv Detail & Related papers (2024-07-10T06:57:58Z)
Assessing Adversarial Robustness of Large Language Models: An Empirical Study [24.271839264950387]
Large Language Models (LLMs) have revolutionized natural language processing, but their robustness against adversarial attacks remains a critical concern. We present a novel white-box style attack approach that exposes vulnerabilities in leading open-source LLMs, including Llama, OPT, and T5.
arXiv Detail & Related papers (2024-05-04T22:00:28Z)
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks. This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z)
Toward Stronger Textual Attack Detectors [43.543044512474886]
LAROUSSE is a new framework to detect textual adversarial attacks. STAKEOUT is a new benchmark composed of nine popular attack methods, three datasets, and two pre-trained models.
arXiv Detail & Related papers (2023-10-21T13:01:29Z)
Adversarial Attacks and Defenses in Machine Learning-Powered Networks: A Contemporary Survey [114.17568992164303]
Adrial attacks and defenses in machine learning and deep neural network have been gaining significant attention. This survey provides a comprehensive overview of the recent advancements in the field of adversarial attack and defense techniques. New avenues of attack are also explored, including search-based, decision-based, drop-based, and physical-world attacks.
arXiv Detail & Related papers (2023-03-11T04:19:31Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.