Related papers: Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples

Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples

URL: http://arxiv.org/abs/2209.02128v1
Date: Mon, 5 Sep 2022 20:29:17 GMT
Title: Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples
Authors: Hezekiah J. Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, Ramesh Darwishi
Abstract summary: We highlight a major security vulnerability in the public release of GPT-3 and investigate this vulnerability in other state-of-the-art PLMs. We underscore token distance-minimized perturbations as an effective adversarial approach, bypassing both supervised and unsupervised quality measures.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent advances in the development of large language models have resulted in public access to state-of-the-art pre-trained language models (PLMs), including Generative Pre-trained Transformer 3 (GPT-3) and Bidirectional Encoder Representations from Transformers (BERT). However, evaluations of PLMs, in practice, have shown their susceptibility to adversarial attacks during the training and fine-tuning stages of development. Such attacks can result in erroneous outputs, model-generated hate speech, and the exposure of users' sensitive information. While existing research has focused on adversarial attacks during either the training or the fine-tuning of PLMs, there is a deficit of information on attacks made between these two development phases. In this work, we highlight a major security vulnerability in the public release of GPT-3 and further investigate this vulnerability in other state-of-the-art PLMs. We restrict our work to pre-trained models that have not undergone fine-tuning. Further, we underscore token distance-minimized perturbations as an effective adversarial approach, bypassing both supervised and unsupervised quality measures. Following this approach, we observe a significant decrease in text classification quality when evaluating for semantic similarity.

Related papers

Overtrained Language Models Are Harder to Fine-Tune [64.44743256512237]
Large language models are pre-trained on ever-growing token budgets. We show that extended pre-training can make models harder to fine-tune, leading to degraded final performance.
arXiv Detail & Related papers (2025-03-24T23:11:56Z)
Robustness of Large Language Models Against Adversarial Attacks [5.312946761836463]
We present a comprehensive study on the robustness of GPT LLM family. We employ two distinct evaluation methods to assess their resilience. Our experiments reveal significant variations in the robustness of these models, demonstrating their varying degrees of vulnerability to both character-level and semantic-level adversarial attacks.
arXiv Detail & Related papers (2024-12-22T13:21:15Z)
Downstream Transfer Attack: Adversarial Attacks on Downstream Models with Pre-trained Vision Transformers [95.22517830759193]
This paper studies the transferability of such an adversarial vulnerability from a pre-trained ViT model to downstream tasks. We show that DTA achieves an average attack success rate (ASR) exceeding 90%, surpassing existing methods by a huge margin.
arXiv Detail & Related papers (2024-08-03T08:07:03Z)
Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses. C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z)
The Pitfalls and Promise of Conformal Inference Under Adversarial Attacks [90.52808174102157]
In safety-critical applications such as medical imaging and autonomous driving, it is imperative to maintain both high adversarial robustness to protect against potential adversarial attacks. A notable knowledge gap remains concerning the uncertainty inherent in adversarially trained models. This study investigates the uncertainty of deep learning models by examining the performance of conformal prediction (CP) in the context of standard adversarial attacks.
arXiv Detail & Related papers (2024-05-14T18:05:19Z)
Semantic Stealth: Adversarial Text Attacks on NLP Using Several Methods [0.0]
A text adversarial attack involves the deliberate manipulation of input text to mislead the predictions of the model. BERT, BERT-on-BERT attack, and Fraud Bargain's Attack (FBA) are explored in this paper. PWWS emerges as the most potent adversary, consistently outperforming other methods across multiple evaluation scenarios.
arXiv Detail & Related papers (2024-04-08T02:55:01Z)
Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack [24.954755569786396]
We propose a framework for a broader class of adversarial attacks, designed to perform minor perturbations in machine-generated content to evade detection. We consider two attack settings: white-box and black-box, and employ adversarial learning in dynamic scenarios to assess the potential enhancement of the current detection model's robustness. The empirical results reveal that the current detection models can be compromised in as little as 10 seconds, leading to the misclassification of machine-generated text as human-written content.
arXiv Detail & Related papers (2024-04-02T12:49:22Z)
Securely Fine-tuning Pre-trained Encoders Against Adversarial Examples [28.947545367473086]
We propose a two-stage adversarial fine-tuning approach aimed at enhancing the robustness of downstream models. Our experiments, conducted across ten self-supervised training methods and six datasets, demonstrate that Gen-AF attains high testing accuracy and robust testing accuracy against state-of-the-art DAEs.
arXiv Detail & Related papers (2024-03-16T04:23:46Z)
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios. We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z)
Text generation for dataset augmentation in security classification tasks [55.70844429868403]
This study evaluates the application of natural language text generators to fill this data gap in multiple security-related text classification tasks. We find substantial benefits for GPT-3 data augmentation strategies in situations with severe limitations on known positive-class samples.
arXiv Detail & Related papers (2023-10-22T22:25:14Z)
Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks [72.03945355787776]
We advocate MDP, a lightweight, pluggable, and effective defense for PLMs as few-shot learners. We show analytically that MDP creates an interesting dilemma for the attacker to choose between attack effectiveness and detection evasiveness.
arXiv Detail & Related papers (2023-09-23T04:41:55Z)
COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in Language Models [4.776465250559034]
We propose a prompt-based adversarial attack on manual templates in black box scenarios. First of all, we design character-level and word-level approaches to break manual templates separately. And we present a greedy algorithm for the attack based on the above destructive approaches.
arXiv Detail & Related papers (2023-06-09T03:53:42Z)
Consistent Valid Physically-Realizable Adversarial Attack against Crowd-flow Prediction Models [4.286570387250455]
deep learning (DL) models can effectively learn city-wide crowd-flow patterns. DL models have been known to perform poorly on inconspicuous adversarial perturbations.
arXiv Detail & Related papers (2023-03-05T13:30:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.