LoFT: Local Proxy Fine-tuning For Improving Transferability Of
Adversarial Attacks Against Large Language Model
- URL: http://arxiv.org/abs/2310.04445v2
- Date: Sat, 21 Oct 2023 17:59:41 GMT
- Title: LoFT: Local Proxy Fine-tuning For Improving Transferability Of
Adversarial Attacks Against Large Language Model
- Authors: Muhammad Ahmed Shah, Roshan Sharma, Hira Dhamyal, Raphael Olivier,
Ankit Shah, Joseph Konan, Dareen Alharthi, Hazim T Bukhari, Massa Baali,
Soham Deshmukh, Michael Kuhlmann, Bhiksha Raj, Rita Singh
- Abstract summary: Local fine-tuning of proxy models improves attack transferability and increases attack success rate by $39%$, $7%$, and $0.5%$ (absolute) on target models ChatGPT, GPT-4, and Claude respectively.
- Score: 29.068442824880016
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It has been shown that Large Language Model (LLM) alignments can be
circumvented by appending specially crafted attack suffixes with harmful
queries to elicit harmful responses. To conduct attacks against private target
models whose characterization is unknown, public models can be used as proxies
to fashion the attack, with successful attacks being transferred from public
proxies to private target models. The success rate of attack depends on how
closely the proxy model approximates the private model. We hypothesize that for
attacks to be transferrable, it is sufficient if the proxy can approximate the
target model in the neighborhood of the harmful query. Therefore, in this
paper, we propose \emph{Local Fine-Tuning (LoFT)}, \textit{i.e.}, fine-tuning
proxy models on similar queries that lie in the lexico-semantic neighborhood of
harmful queries to decrease the divergence between the proxy and target models.
First, we demonstrate three approaches to prompt private target models to
obtain similar queries given harmful queries. Next, we obtain data for local
fine-tuning by eliciting responses from target models for the generated similar
queries. Then, we optimize attack suffixes to generate attack prompts and
evaluate the impact of our local fine-tuning on the attack's success rate.
Experiments show that local fine-tuning of proxy models improves attack
transferability and increases attack success rate by $39\%$, $7\%$, and $0.5\%$
(absolute) on target models ChatGPT, GPT-4, and Claude respectively.
Related papers
- A Realistic Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.
Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text.
We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing.
arXiv Detail & Related papers (2024-10-21T17:27:01Z) - QUEEN: Query Unlearning against Model Extraction [22.434812818540966]
Model extraction attacks pose a non-negligible threat to the security and privacy of deep learning models.
We propose QUEEN (QUEry unlEarNing) that proactively launches counterattacks on potential model extraction attacks.
arXiv Detail & Related papers (2024-07-01T13:01:41Z) - Query-Based Adversarial Prompt Generation [67.238873588125]
We build adversarial examples that cause an aligned language model to emit harmful strings.
We validate our attack on GPT-3.5 and OpenAI's safety classifier.
arXiv Detail & Related papers (2024-02-19T18:01:36Z) - DTA: Distribution Transform-based Attack for Query-Limited Scenario [11.874670564015789]
In generating adversarial examples, the conventional black-box attack methods rely on sufficient feedback from the to-be-attacked models.
This paper proposes a hard-label attack that simulates an attacked action being permitted to conduct a limited number of queries.
Experiments validate the effectiveness of the proposed idea and the superiority of DTA over the state-of-the-art.
arXiv Detail & Related papers (2023-12-12T13:21:03Z) - Transferable Attack for Semantic Segmentation [59.17710830038692]
adversarial attacks, and observe that the adversarial examples generated from a source model fail to attack the target models.
We propose an ensemble attack for semantic segmentation to achieve more effective attacks with higher transferability.
arXiv Detail & Related papers (2023-07-31T11:05:55Z) - MPAF: Model Poisoning Attacks to Federated Learning based on Fake
Clients [51.973224448076614]
We propose the first Model Poisoning Attack based on Fake clients called MPAF.
MPAF can significantly decrease the test accuracy of the global model, even if classical defenses and norm clipping are adopted.
arXiv Detail & Related papers (2022-03-16T14:59:40Z) - Target Model Agnostic Adversarial Attacks with Query Budgets on Language
Understanding Models [14.738950386902518]
We propose a target model adversarial attack method with a high degree of attack transferability across the attacked models.
Our empirical studies show that our method generates highly transferable adversarial sentences under the restriction of limited query budgets.
arXiv Detail & Related papers (2021-06-13T17:18:19Z) - Learning to Attack: Towards Textual Adversarial Attacking in Real-world
Situations [81.82518920087175]
Adversarial attacking aims to fool deep neural networks with adversarial examples.
We propose a reinforcement learning based attack model, which can learn from attack history and launch attacks more efficiently.
arXiv Detail & Related papers (2020-09-19T09:12:24Z) - Adversarial Imitation Attack [63.76805962712481]
A practical adversarial attack should require as little as possible knowledge of attacked models.
Current substitute attacks need pre-trained models to generate adversarial examples.
In this study, we propose a novel adversarial imitation attack.
arXiv Detail & Related papers (2020-03-28T10:02:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.