DPP-Based Adversarial Prompt Searching for Lanugage Models
- URL: http://arxiv.org/abs/2403.00292v1
- Date: Fri, 1 Mar 2024 05:28:06 GMT
- Title: DPP-Based Adversarial Prompt Searching for Lanugage Models
- Authors: Xu Zhang and Xiaojun Wan
- Abstract summary: Auto-regressive Selective Replacement Ascent (ASRA) is a discrete optimization algorithm that selects prompts based on both quality and similarity with determinantal point process (DPP)
Experimental results on six different pre-trained language models demonstrate the efficacy of ASRA for eliciting toxic content.
- Score: 56.73828162194457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models risk generating mindless and offensive content, which hinders
their safe deployment. Therefore, it is crucial to discover and modify
potential toxic outputs of pre-trained language models before deployment. In
this work, we elicit toxic content by automatically searching for a prompt that
directs pre-trained language models towards the generation of a specific target
output. The problem is challenging due to the discrete nature of textual data
and the considerable computational resources required for a single forward pass
of the language model. To combat these challenges, we introduce Auto-regressive
Selective Replacement Ascent (ASRA), a discrete optimization algorithm that
selects prompts based on both quality and similarity with determinantal point
process (DPP). Experimental results on six different pre-trained language
models demonstrate the efficacy of ASRA for eliciting toxic content.
Furthermore, our analysis reveals a strong correlation between the success rate
of ASRA attacks and the perplexity of target outputs, while indicating limited
association with the quantity of model parameters.
Related papers
- A linguistic analysis of undesirable outcomes in the era of generative AI [4.841442157674423]
We present a comprehensive simulation framework built upon the chat version of LLama2, focusing on the linguistic aspects of the generated content.
Our results show that the model produces less lexical rich content across generations, reducing diversity.
We find that autophagy transforms the initial model into a more creative, doubtful and confused one, which might provide inaccurate answers.
arXiv Detail & Related papers (2024-10-16T08:02:48Z) - Large Language Models can be Strong Self-Detoxifiers [82.6594169242814]
Self-disciplined Autoregressive Sampling (SASA) is a lightweight controlled decoding algorithm for toxicity reduction of large language models (LLMs)
SASA tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy.
evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks.
arXiv Detail & Related papers (2024-10-04T17:45:15Z) - Enhancing adversarial robustness in Natural Language Inference using explanations [41.46494686136601]
We cast the spotlight on the underexplored task of Natural Language Inference (NLI)
We validate the usage of natural language explanation as a model-agnostic defence strategy through extensive experimentation.
We research the correlation of widely used language generation metrics with human perception, in order for them to serve as a proxy towards robust NLI models.
arXiv Detail & Related papers (2024-09-11T17:09:49Z) - Selective Forgetting: Advancing Machine Unlearning Techniques and
Evaluation in Language Models [24.784439330058095]
This study investigates concerns related to neural models inadvertently retaining personal or sensitive data.
A novel approach is introduced to achieve precise and selective forgetting within language models.
Two innovative evaluation metrics are proposed: Sensitive Information Extraction Likelihood (S-EL) and Sensitive Information Memory Accuracy (S-MA)
arXiv Detail & Related papers (2024-02-08T16:50:01Z) - A Generative Adversarial Attack for Multilingual Text Classifiers [10.993289209465129]
We propose an approach to fine-tune a multilingual paraphrase model with an adversarial objective.
The training objective incorporates a set of pre-trained models to ensure text quality and language consistency.
The experimental validation over two multilingual datasets and five languages has shown the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-01-16T10:14:27Z) - AUTOLYCUS: Exploiting Explainable AI (XAI) for Model Extraction Attacks against Interpretable Models [1.8752655643513647]
XAI tools can increase the vulnerability of model extraction attacks, which is a concern when model owners prefer black-box access.
We propose a novel retraining (learning) based model extraction attack framework against interpretable models under black-box settings.
We show that AUTOLYCUS is highly effective, requiring significantly fewer queries compared to state-of-the-art attacks.
arXiv Detail & Related papers (2023-02-04T13:23:39Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - LaMDA: Language Models for Dialog Applications [75.75051929981933]
LaMDA is a family of Transformer-based neural language models specialized for dialog.
Fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements.
arXiv Detail & Related papers (2022-01-20T15:44:37Z) - NoiER: An Approach for Training more Reliable Fine-TunedDownstream Task
Models [54.184609286094044]
We propose noise entropy regularisation (NoiER) as an efficient learning paradigm that solves the problem without auxiliary models and additional data.
The proposed approach improved traditional OOD detection evaluation metrics by 55% on average compared to the original fine-tuned models.
arXiv Detail & Related papers (2021-08-29T06:58:28Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.