Related papers: Stealth edits to large language models

Stealth edits to large language models

URL: http://arxiv.org/abs/2406.12670v2
Date: Wed, 30 Oct 2024 10:12:24 GMT
Title: Stealth edits to large language models
Authors: Oliver J. Sutton, Qinghua Zhou, Wei Wang, Desmond J. Higham, Alexander N. Gorban, Alexander Bastounis, Ivan Y. Tyukin,
Abstract summary: We show that a single metric can be used to assess a model's editability. We also reveal the vulnerability of language models to stealth attacks.
Score: 76.53356051271014
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We reveal the theoretical foundations of techniques for editing large language models, and present new methods which can do so without requiring retraining. Our theoretical insights show that a single metric (a measure of the intrinsic dimension of the model's features) can be used to assess a model's editability and reveals its previously unrecognised susceptibility to malicious stealth attacks. This metric is fundamental to predicting the success of a variety of editing approaches, and reveals new bridges between disparate families of editing methods. We collectively refer to these as stealth editing methods, because they directly update a model's weights to specify its response to specific known hallucinating prompts without affecting other model behaviour. By carefully applying our theoretical insights, we are able to introduce a new jet-pack network block which is optimised for highly selective model editing, uses only standard network operations, and can be inserted into existing networks. We also reveal the vulnerability of language models to stealth attacks: a small change to a model's weights which fixes its response to a single attacker-chosen prompt. Stealth attacks are computationally simple, do not require access to or knowledge of the model's training data, and therefore represent a potent yet previously unrecognised threat to redistributed foundation models. Extensive experimental results illustrate and support our methods and their theoretical underpinnings. Demos and source code are available at https://github.com/qinghua-zhou/stealth-edits.

Related papers

Adversarial Robustification via Text-to-Image Diffusion Models [56.37291240867549]
Adrial robustness has been conventionally believed as a challenging property to encode for neural networks. We develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data.
arXiv Detail & Related papers (2024-07-26T10:49:14Z)
Model Pairing Using Embedding Translation for Backdoor Attack Detection on Open-Set Classification Tasks [63.269788236474234]
We propose to use model pairs on open-set classification tasks for detecting backdoors. We show that this score, can be an indicator for the presence of a backdoor despite models being of different architectures. This technique allows for the detection of backdoors on models designed for open-set classification tasks, which is little studied in the literature.
arXiv Detail & Related papers (2024-02-28T21:29:16Z)
Isolation and Induction: Training Robust Deep Neural Networks against Model Stealing Attacks [51.51023951695014]
Existing model stealing defenses add deceptive perturbations to the victim's posterior probabilities to mislead the attackers. This paper proposes Isolation and Induction (InI), a novel and effective training framework for model stealing defenses. In contrast to adding perturbations over model predictions that harm the benign accuracy, we train models to produce uninformative outputs against stealing queries.
arXiv Detail & Related papers (2023-08-02T05:54:01Z)
Trojan Model Detection Using Activation Optimization [15.032071953322594]
Training machine learning models can be very expensive or even unaffordable. Pre-trained models can be infected with Trojan attacks. We present a novel method for detecting Trojan models.
arXiv Detail & Related papers (2023-06-08T02:17:29Z)
Defense-Prefix for Preventing Typographic Attacks on CLIP [14.832208701208414]
Some adversarial attacks fool a model into false or absurd classifications. We introduce our simple yet effective method: Defense-Prefix (DP), which inserts the DP token before a class name to make words "robust" against typographic attacks. Our method significantly improves the accuracy of classification tasks for typographic attack datasets, while maintaining the zero-shot capabilities of the model.
arXiv Detail & Related papers (2023-04-10T11:05:20Z)
A Plot is Worth a Thousand Words: Model Information Stealing Attacks via Scientific Plots [14.998272283348152]
It is well known that an adversary can leverage a target ML model's output to steal the model's information. We propose a new side channel for model information stealing attacks, i.e., models' scientific plots.
arXiv Detail & Related papers (2023-02-23T12:57:34Z)
MOVE: Effective and Harmless Ownership Verification via Embedded External Features [109.19238806106426]
We propose an effective and harmless model ownership verification (MOVE) to defend against different types of model stealing simultaneously. We conduct the ownership verification by verifying whether a suspicious model contains the knowledge of defender-specified external features. In particular, we develop our MOVE method under both white-box and black-box settings to provide comprehensive model protection.
arXiv Detail & Related papers (2022-08-04T02:22:29Z)
MEGA: Model Stealing via Collaborative Generator-Substitute Networks [4.065949099860426]
Recent data-free model stealingmethods are shown effective to extract the knowledge of thetarget model without using real query examples. We propose a data-free model stealing frame-work,MEGA, which is based on collaborative generator-substitute networks. Our results show that theaccuracy of our trained substitute model and the adversarialattack success rate over it can be up to 33% and 40% higherthan state-of-the-art data-free black-box attacks.
arXiv Detail & Related papers (2022-01-31T09:34:28Z)
Exploring Strategies for Generalizable Commonsense Reasoning with Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models. Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers. We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z)
Target Model Agnostic Adversarial Attacks with Query Budgets on Language Understanding Models [14.738950386902518]
We propose a target model adversarial attack method with a high degree of attack transferability across the attacked models. Our empirical studies show that our method generates highly transferable adversarial sentences under the restriction of limited query budgets.
arXiv Detail & Related papers (2021-06-13T17:18:19Z)
Query-free Black-box Adversarial Attacks on Graphs [37.88689315688314]
We propose a query-free black-box adversarial attack on graphs, in which the attacker has no knowledge of the target model and no query access to the model. We prove that the impact of the flipped links on the target model can be quantified by spectral changes, and thus be approximated using the eigenvalue perturbation theory. Due to its simplicity and scalability, the proposed model is not only generic in various graph-based models, but can be easily extended when different knowledge levels are accessible as well.
arXiv Detail & Related papers (2020-12-12T08:52:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.