Whispers in the Machine: Confidentiality in LLM-integrated Systems
- URL: http://arxiv.org/abs/2402.06922v1
- Date: Sat, 10 Feb 2024 11:07:24 GMT
- Title: Whispers in the Machine: Confidentiality in LLM-integrated Systems
- Authors: Jonathan Evertz, Merlin Chlosta, Lea Sch\"onherr, Thorsten Eisenhofer
- Abstract summary: Large Language Models (LLMs) are increasingly integrated with external tools.
Malicious tools can exploit vulnerabilities in the LLM itself to manipulate the model and compromise the data of other services.
We provide a systematic way of evaluating confidentiality in LLM-integrated systems.
- Score: 5.500627268249088
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are increasingly integrated with external tools.
While these integrations can significantly improve the functionality of LLMs,
they also create a new attack surface where confidential data may be disclosed
between different components. Specifically, malicious tools can exploit
vulnerabilities in the LLM itself to manipulate the model and compromise the
data of other services, raising the question of how private data can be
protected in the context of LLM integrations.
In this work, we provide a systematic way of evaluating confidentiality in
LLM-integrated systems. For this, we formalize a "secret key" game that can
capture the ability of a model to conceal private information. This enables us
to compare the vulnerability of a model against confidentiality attacks and
also the effectiveness of different defense strategies. In this framework, we
evaluate eight previously published attacks and four defenses. We find that
current defenses lack generalization across attack strategies. Building on this
analysis, we propose a method for robustness fine-tuning, inspired by
adversarial training. This approach is effective in lowering the success rate
of attackers and in improving the system's resilience against unknown attacks.
Related papers
- garak: A Framework for Security Probing Large Language Models [16.305837349514505]
garak is a framework which can be used to discover and identify vulnerabilities in a target Large Language Models (LLMs)
The outputs of the framework describe a target model's weaknesses, contribute to an informed discussion of what composes vulnerabilities in unique contexts.
arXiv Detail & Related papers (2024-06-16T18:18:43Z) - Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models [51.85781332922943]
Federated learning (FL) enables multiple parties to collaboratively fine-tune an large language model (LLM) without the need of direct data sharing.
We for the first time reveal the vulnerability of safety alignment in FedIT by proposing a simple, stealthy, yet effective safety attack method.
arXiv Detail & Related papers (2024-06-15T13:24:22Z) - Defending Large Language Models Against Attacks With Residual Stream Activation Analysis [0.0]
Large Language Models (LLMs) are vulnerable to adversarial threats.
This paper presents an innovative defensive strategy, given white box access to an LLM.
We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
arXiv Detail & Related papers (2024-06-05T13:06:33Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - A New Era in LLM Security: Exploring Security Concerns in Real-World
LLM-based Systems [47.18371401090435]
We analyze the security of Large Language Model (LLM) systems, instead of focusing on the individual LLMs.
We propose a multi-layer and multi-step approach and apply it to the state-of-art OpenAI GPT4.
We found that although the OpenAI GPT4 has designed numerous safety constraints to improve its safety features, these safety constraints are still vulnerable to attackers.
arXiv Detail & Related papers (2024-02-28T19:00:12Z) - Attack Prompt Generation for Red Teaming and Defending Large Language
Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content.
We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z) - Baseline Defenses for Adversarial Attacks Against Aligned Language
Models [109.75753454188705]
Recent work shows that text moderations can produce jailbreaking prompts that bypass defenses.
We look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training.
We find that the weakness of existing discretes for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs.
arXiv Detail & Related papers (2023-09-01T17:59:44Z) - Not what you've signed up for: Compromising Real-World LLM-Integrated
Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications.
We show how attackers can override original instructions and employed controls using Prompt Injection attacks.
We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z) - FLIP: A Provable Defense Framework for Backdoor Mitigation in Federated
Learning [66.56240101249803]
We study how hardening benign clients can affect the global model (and the malicious clients)
We propose a trigger reverse engineering based defense and show that our method can achieve improvement with guarantee robustness.
Our results on eight competing SOTA defense methods show the empirical superiority of our method on both single-shot and continuous FL backdoor attacks.
arXiv Detail & Related papers (2022-10-23T22:24:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.