DPP-Based Adversarial Prompt Searching for Lanugage Models
- URL: http://arxiv.org/abs/2403.00292v1
- Date: Fri, 1 Mar 2024 05:28:06 GMT
- Title: DPP-Based Adversarial Prompt Searching for Lanugage Models
- Authors: Xu Zhang and Xiaojun Wan
- Abstract summary: Auto-regressive Selective Replacement Ascent (ASRA) is a discrete optimization algorithm that selects prompts based on both quality and similarity with determinantal point process (DPP)
Experimental results on six different pre-trained language models demonstrate the efficacy of ASRA for eliciting toxic content.
- Score: 56.73828162194457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models risk generating mindless and offensive content, which hinders
their safe deployment. Therefore, it is crucial to discover and modify
potential toxic outputs of pre-trained language models before deployment. In
this work, we elicit toxic content by automatically searching for a prompt that
directs pre-trained language models towards the generation of a specific target
output. The problem is challenging due to the discrete nature of textual data
and the considerable computational resources required for a single forward pass
of the language model. To combat these challenges, we introduce Auto-regressive
Selective Replacement Ascent (ASRA), a discrete optimization algorithm that
selects prompts based on both quality and similarity with determinantal point
process (DPP). Experimental results on six different pre-trained language
models demonstrate the efficacy of ASRA for eliciting toxic content.
Furthermore, our analysis reveals a strong correlation between the success rate
of ASRA attacks and the perplexity of target outputs, while indicating limited
association with the quantity of model parameters.
Related papers
- Selective Forgetting: Advancing Machine Unlearning Techniques and
Evaluation in Language Models [24.784439330058095]
This study investigates concerns related to neural models inadvertently retaining personal or sensitive data.
A novel approach is introduced to achieve precise and selective forgetting within language models.
Two innovative evaluation metrics are proposed: Sensitive Information Extraction Likelihood (S-EL) and Sensitive Information Memory Accuracy (S-MA)
arXiv Detail & Related papers (2024-02-08T16:50:01Z) - A Generative Adversarial Attack for Multilingual Text Classifiers [10.993289209465129]
We propose an approach to fine-tune a multilingual paraphrase model with an adversarial objective.
The training objective incorporates a set of pre-trained models to ensure text quality and language consistency.
The experimental validation over two multilingual datasets and five languages has shown the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-01-16T10:14:27Z) - A Quantitative Approach to Understand Self-Supervised Models as
Cross-lingual Feature Extractors [9.279391026742658]
We analyze the effect of model size, training objectives, and model architecture on the models' performance as a feature extractor.
We develop a novel metric, the Phonetic-Syntax Ratio (PSR), to measure the phonetic and synthetic information in the extracted representations.
arXiv Detail & Related papers (2023-11-27T15:58:28Z) - AUTOLYCUS: Exploiting Explainable AI (XAI) for Model Extraction Attacks against Interpretable Models [1.8752655643513647]
XAI tools can increase the vulnerability of model extraction attacks, which is a concern when model owners prefer black-box access.
We propose a novel retraining (learning) based model extraction attack framework against interpretable models under black-box settings.
We show that AUTOLYCUS is highly effective, requiring significantly fewer queries compared to state-of-the-art attacks.
arXiv Detail & Related papers (2023-02-04T13:23:39Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - LaMDA: Language Models for Dialog Applications [75.75051929981933]
LaMDA is a family of Transformer-based neural language models specialized for dialog.
Fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements.
arXiv Detail & Related papers (2022-01-20T15:44:37Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - NoiER: An Approach for Training more Reliable Fine-TunedDownstream Task
Models [54.184609286094044]
We propose noise entropy regularisation (NoiER) as an efficient learning paradigm that solves the problem without auxiliary models and additional data.
The proposed approach improved traditional OOD detection evaluation metrics by 55% on average compared to the original fine-tuned models.
arXiv Detail & Related papers (2021-08-29T06:58:28Z) - Text Generation by Learning from Demonstrations [17.549815256968877]
Current approaches to text generation largely rely on autoregressive models and maximum likelihood estimation.
We propose GOLD: an easy-to-optimize algorithm that learns from expert demonstrations by importance weighting.
According to both automatic and human evaluation, models trained by GOLD outperform those trained by MLE and policy gradient.
arXiv Detail & Related papers (2020-09-16T17:58:37Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.