Related papers: Model Leeching: An Extraction Attack Targeting LLMs

Model Leeching: An Extraction Attack Targeting LLMs

URL: http://arxiv.org/abs/2309.10544v1
Date: Tue, 19 Sep 2023 11:45:29 GMT
Title: Model Leeching: An Extraction Attack Targeting LLMs
Authors: Lewis Birch, William Hackett, Stefan Trawicki, Neeraj Suri, Peter Garraghan
Abstract summary: Model Leeching is a novel extraction attack targeting Large Language Models (LLMs) We demonstrate the effectiveness of our attack by extracting task capability from ChatGPT-3.5-Turbo, achieving 73% Exact Match (EM) similarity, and SQuAD EM and F1 accuracy scores of 75% and 87%, respectively for only $50 in API cost.
Score: 4.533013952442819
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Model Leeching is a novel extraction attack targeting Large Language Models (LLMs), capable of distilling task-specific knowledge from a target LLM into a reduced parameter model. We demonstrate the effectiveness of our attack by extracting task capability from ChatGPT-3.5-Turbo, achieving 73% Exact Match (EM) similarity, and SQuAD EM and F1 accuracy scores of 75% and 87%, respectively for only $50 in API cost. We further demonstrate the feasibility of adversarial attack transferability from an extracted model extracted via Model Leeching to perform ML attack staging against a target LLM, resulting in an 11% increase to attack success rate when applied to ChatGPT-3.5-Turbo.

Related papers

AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage. We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z)
Mitigating Adversarial Attacks in LLMs through Defensive Suffix Generation [2.3080718283523827]
Large language models (LLMs) have exhibited outstanding performance in natural language processing tasks. adversarial attacks in which slight input perturbations can lead to harmful or misleading outputs. gradient-based defensive suffix generation algorithm is designed to bolster the robustness of LLMs.
arXiv Detail & Related papers (2024-12-18T10:49:41Z)
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models [92.79804303337522]
Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues. We introduce MLAI, a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment. Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2.
arXiv Detail & Related papers (2024-11-27T02:40:29Z)
LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs [13.36946005380889]
We introduce LLMStinger, a novel approach that leverages Large Language Models (LLMs) to automatically generate adversarial suffixes for jailbreak attacks. Our method significantly outperforms existing red-teaming approaches, achieving a +57.2% improvement in Attack Success Rate (ASR) on LLaMA2-7B-chat and a +50.3% increase on Claude 2.
arXiv Detail & Related papers (2024-11-13T18:44:30Z)
LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering. We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z)
Defending Large Language Models Against Attacks With Residual Stream Activation Analysis [0.0]
Large Language Models (LLMs) are vulnerable to adversarial threats. This paper presents an innovative defensive strategy, given white box access to an LLM. We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
arXiv Detail & Related papers (2024-06-05T13:06:33Z)
Improved Generation of Adversarial Examples Against Safety-aligned LLMs [72.38072942860309]
Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks. We show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench.
arXiv Detail & Related papers (2024-05-28T06:10:12Z)
TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning [63.481446315733145]
Cross-lingual backdoor attacks against multilingual large language models (LLMs) are under-explored. Our research focuses on how poisoning the instruction-tuning data for one or two languages can affect the outputs for languages whose instruction-tuning data were not poisoned. Our method exhibits remarkable efficacy in models like mT5 and GPT-4o, with high attack success rates, surpassing 90% in more than 7 out of 12 languages.
arXiv Detail & Related papers (2024-04-30T14:43:57Z)
PAL: Proxy-Guided Black-Box Attack on Large Language Models [55.57987172146731]
Large Language Models (LLMs) have surged in popularity in recent months, but they have demonstrated capabilities to generate harmful content when manipulated. We introduce the Proxy-Guided Attack on LLMs (PAL), the first optimization-based attack on LLMs in a black-box query-only setting. Our attack achieves 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, compared to 4% for the current state of the art.
arXiv Detail & Related papers (2024-02-15T02:54:49Z)
Data-Free Hard-Label Robustness Stealing Attack [67.41281050467889]
We introduce a novel Data-Free Hard-Label Robustness Stealing (DFHL-RS) attack in this paper. It enables the stealing of both model accuracy and robustness by simply querying hard labels of the target model. Our method achieves a clean accuracy of 77.86% and a robust accuracy of 39.51% against AutoAttack.
arXiv Detail & Related papers (2023-12-10T16:14:02Z)
Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web [74.76803612807949]
Language model agents (LMA) emerged as a promising paradigm on muti-step decision making tasks. Despite the promise, their performance on real-world applications is still underexplored. We show that while existing LMAs achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks.
arXiv Detail & Related papers (2023-11-30T17:50:47Z)
Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning [14.228909822681373]
Large Language Models (LLMs) are known to memorize significant portions of their training data. We present a novel approach which uses prompt-tuning to control the extraction rates of memorized content in LLMs.
arXiv Detail & Related papers (2023-05-19T15:45:29Z)
Targeted Attack on GPT-Neo for the SATML Language Model Data Extraction Challenge [4.438873396405334]
We apply a targeted data extraction attack to the SATML2023 Language Model Training Data Extraction Challenge. We maximise the recall of the model and are able to extract the suffix for 69% of the samples. Our approach reaches a score of 0.405 recall at a 10% false positive rate, which is an improvement of 34% over the baseline of 0.301.
arXiv Detail & Related papers (2023-02-13T18:00:44Z)
Learning to Ignore Adversarial Attacks [14.24585085013907]
We introduce the use of rationale models that can explicitly learn to ignore attack tokens. We find that the rationale models can successfully ignore over 90% of attack tokens.
arXiv Detail & Related papers (2022-05-23T18:01:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.