Related papers: Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

URL: http://arxiv.org/abs/2406.07954v1
Date: Wed, 12 Jun 2024 07:27:28 GMT
Title: Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Authors: Edoardo Debenedetti, Javier Rando, Daniel Paleka, Silaghi Fineas Florin, Dragos Albastroiu, Niv Cohen, Yuval Lemberg, Reshmi Ghosh, Rui Wen, Ahmed Salem, Giovanni Cherubin, Santiago Zanella-Beguelin, Robin Schmid, Victor Klemm, Takahiro Miki, Chenhao Li, Stefan Kraft, Mario Fritz, Florian Tramèr, Sahar Abdelnabi, Lea Schönherr,
Abstract summary: Large language model systems face important security risks from maliciously crafted messages. To study this problem, we organized a capture-the-flag competition at IEEE SaTML 2024. This report summarizes the main insights from the competition.
Score: 64.03517222829902
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language model systems face important security risks from maliciously crafted messages that aim to overwrite the system's original instructions or leak private data. To study this problem, we organized a capture-the-flag competition at IEEE SaTML 2024, where the flag is a secret string in the LLM system prompt. The competition was organized in two phases. In the first phase, teams developed defenses to prevent the model from leaking the secret. During the second phase, teams were challenged to extract the secrets hidden for defenses proposed by the other teams. This report summarizes the main insights from the competition. Notably, we found that all defenses were bypassed at least once, highlighting the difficulty of designing a successful defense and the necessity for additional research to protect LLM systems. To foster future research in this direction, we compiled a dataset with over 137k multi-turn attack chats and open-sourced the platform.

Related papers

Model Inversion in Split Learning for Personalized LLMs: New Insights from Information Bottleneck Theory [11.83473842859642]
This work is the first to identify model inversion attacks in the split learning framework for personalized LLMs. We propose a two-stage attack system in which the first part projects representations into the embedding space, and the second part uses a generative model to recover text from these embeddings.
arXiv Detail & Related papers (2025-01-10T13:47:13Z)
Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models [35.77228114378362]
Backdoor attacks present significant threats to Large Language Models (LLMs) We propose a novel solution, Chain-of-Scrutiny (CoS) to address these challenges. CoS guides the LLMs to generate detailed reasoning steps for the input, then scrutinizes the reasoning process to ensure consistency with the final answer.
arXiv Detail & Related papers (2024-06-10T00:53:25Z)
Prompt Leakage effect and defense strategies for multi-turn LLM interactions [95.33778028192593]
Leakage of system prompts may compromise intellectual property and act as adversarial reconnaissance for an attacker. We design a unique threat model which leverages the LLM sycophancy effect and elevates the average attack success rate (ASR) from 17.7% to 86.2% in a multi-turn setting. We measure the mitigation effect of 7 black-box defense strategies, along with finetuning an open-source model to defend against leakage attempts.
arXiv Detail & Related papers (2024-04-24T23:39:58Z)
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game [86.66627242073724]
This paper presents a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection. To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs. We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking.
arXiv Detail & Related papers (2023-11-02T06:13:36Z)
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks [5.860289498416911]
Large Language Models (LLMs) are swiftly advancing in architecture and capability. As they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows. This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs.
arXiv Detail & Related papers (2023-10-16T21:37:24Z)
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs) Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. Recent works have proposed algorithms to detect LLM-generated text and protect LLMs. We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z)
Challenges and approaches for mitigating byzantine attacks in federated learning [6.836162272841266]
Federated learning (FL) is an attractive distributed learning framework in which numerous wireless end-user devices can train a global model with the data remained autochthonous. Despite the promising prospect, byzantine attack, an intractable threat in conventional distributed network, is discovered to be rather efficacious against FL as well. We propose a new byzantine attack method called weight attack to defeat those defense schemes, and conduct experiments to demonstrate its threat.
arXiv Detail & Related papers (2021-12-29T09:24:05Z)
Privacy and Robustness in Federated Learning: Attacks and Defenses [74.62641494122988]
We conduct the first comprehensive survey on this topic. Through a concise introduction to the concept of FL, and a unique taxonomy covering: 1) threat models; 2) poisoning attacks and defenses against robustness; 3) inference attacks and defenses against privacy, we provide an accessible review of this important topic.
arXiv Detail & Related papers (2020-12-07T12:11:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.