Misusing Tools in Large Language Models With Visual Adversarial Examples
- URL: http://arxiv.org/abs/2310.03185v1
- Date: Wed, 4 Oct 2023 22:10:01 GMT
- Title: Misusing Tools in Large Language Models With Visual Adversarial Examples
- Authors: Xiaohan Fu, Zihan Wang, Shuheng Li, Rajesh K. Gupta, Niloofar
Mireshghallah, Taylor Berg-Kirkpatrick, Earlence Fernandes
- Abstract summary: We show that an attacker can use visual adversarial examples to cause attacker-desired tool usage.
For example, the attacker could cause a victim LLM to delete calendar events, leak private conversations and book hotels.
We construct these attacks using gradient-based adversarial training and characterize performance along multiple dimensions.
- Score: 34.82432122637917
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) are being enhanced with the ability to use tools
and to process multiple modalities. These new capabilities bring new benefits
and also new security risks. In this work, we show that an attacker can use
visual adversarial examples to cause attacker-desired tool usage. For example,
the attacker could cause a victim LLM to delete calendar events, leak private
conversations and book hotels. Different from prior work, our attacks can
affect the confidentiality and integrity of user resources connected to the LLM
while being stealthy and generalizable to multiple input prompts. We construct
these attacks using gradient-based adversarial training and characterize
performance along multiple dimensions. We find that our adversarial images can
manipulate the LLM to invoke tools following real-world syntax almost always
(~98%) while maintaining high similarity to clean images (~0.9 SSIM).
Furthermore, using human scoring and automated metrics, we find that the
attacks do not noticeably affect the conversation (and its semantics) between
the user and the LLM.
Related papers
- Imprompter: Tricking LLM Agents into Improper Tool Use [35.255462653237885]
Large Language Model (LLM) Agents are an emerging computing paradigm that blends generative machine learning with tools such as code interpreters, web browsing, email, and more generally, external resources.
We contribute to the security foundations of agent-based systems and surface a new class of automatically computed obfuscated adversarial prompt attacks.
arXiv Detail & Related papers (2024-10-19T01:00:57Z) - Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
This research explores converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing.
We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM.
Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z) - Defending Against Indirect Prompt Injection Attacks With Spotlighting [11.127479817618692]
In common applications, multiple inputs can be processed by concatenating them together into a single stream of text.
Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands.
We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input.
arXiv Detail & Related papers (2024-03-20T15:26:23Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - Attack Prompt Generation for Red Teaming and Defending Large Language
Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content.
We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z) - Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks.
We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision.
Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.