Related papers: Adding Instructions during Pretraining: Effective Way of Controlling Toxicity in Language Models

Adding Instructions during Pretraining: Effective Way of Controlling Toxicity in Language Models

URL: http://arxiv.org/abs/2302.07388v1
Date: Tue, 14 Feb 2023 23:00:42 GMT
Title: Adding Instructions during Pretraining: Effective Way of Controlling Toxicity in Language Models
Authors: Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro
Abstract summary: We propose two novel pretraining data augmentation strategies that significantly reduce model toxicity without compromising its utility. Our two strategies are: (1) MEDA: adds raw toxicity score as meta-data to the pretraining samples, and (2) INST: adds instructions to those samples indicating their toxicity. Our results indicate that our best performing strategy (INST) substantially reduces the toxicity probability up to 61% while preserving the accuracy on five benchmark NLP tasks.
Score: 29.505176809305095
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pretrained large language models have become indispensable for solving various natural language processing (NLP) tasks. However, safely deploying them in real world applications is challenging because they generate toxic content. To address this challenge, we propose two novel pretraining data augmentation strategies that significantly reduce model toxicity without compromising its utility. Our two strategies are: (1) MEDA: adds raw toxicity score as meta-data to the pretraining samples, and (2) INST: adds instructions to those samples indicating their toxicity. Our results indicate that our best performing strategy (INST) substantially reduces the toxicity probability up to 61% while preserving the accuracy on five benchmark NLP tasks as well as improving AUC scores on four bias detection tasks by 1.3%. We also demonstrate the generalizability of our techniques by scaling the number of training samples and the number of model parameters.

Related papers

The Surprising Effectiveness of Test-Time Training for Abstract Reasoning [64.36534512742736]
We investigate the effectiveness of test-time training (TTT) as a mechanism for improving models' reasoning capabilities. TTT significantly improves performance on ARC tasks, achieving up to 6x improvement in accuracy compared to base fine-tuned models. Our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in neural language models.
arXiv Detail & Related papers (2024-11-11T18:59:45Z)
Persistent Pre-Training Poisoning of LLMs [71.53046642099142]
Our work evaluates for the first time whether language models can also be compromised during pre-training. We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary. Our main result is that poisoning only 0.1% of a model's pre-training dataset is sufficient for three out of four attacks to persist through post-training.
arXiv Detail & Related papers (2024-10-17T16:27:13Z)
TaeBench: Improving Quality of Toxic Adversarial Examples [10.768188905349874]
This paper proposes an annotation pipeline for quality control of generated toxic adversarial examples (TAE) We design model-based automated annotation and human-based quality verification to assess the quality requirements of TAE. We show that TaeBench can effectively transfer-attack SOTA toxicity content moderation models and services.
arXiv Detail & Related papers (2024-10-08T00:14:27Z)
DPP-Based Adversarial Prompt Searching for Lanugage Models [56.73828162194457]
Auto-regressive Selective Replacement Ascent (ASRA) is a discrete optimization algorithm that selects prompts based on both quality and similarity with determinantal point process (DPP) Experimental results on six different pre-trained language models demonstrate the efficacy of ASRA for eliciting toxic content.
arXiv Detail & Related papers (2024-03-01T05:28:06Z)
You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content [13.600755614321493]
We investigate how we can use large language models (LLMs) to tackle the problem of toxic content online. We focus on three tasks; 1) Toxicity Classification, 2) Toxic Span Detection, and 3) Detoxification. We find that prompt learning achieves around 10% improvement in the toxicity classification task compared to the baselines.
arXiv Detail & Related papers (2023-08-10T14:14:13Z)
Improving Pre-trained Language Model Fine-tuning with Noise Stability Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR) Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model. We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z)
Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models [84.30718841659531]
We explore domain-adaptive training to reduce the toxicity of language models. For the training corpus, we propose to leverage the generative power of LMs. We then comprehensively study LMs with parameter sizes ranging from 126M up to 530B, a scale that has never been studied before.
arXiv Detail & Related papers (2022-02-08T22:10:40Z)
UoB at SemEval-2021 Task 5: Extending Pre-Trained Language Models to Include Task and Domain-Specific Information for Toxic Span Prediction [0.8376091455761259]
Toxicity is pervasive in social media and poses a major threat to the health of online communities. Recent introduction of pre-trained language models, which have achieved state-of-the-art results in many NLP tasks, has transformed the way in which we approach natural language processing.
arXiv Detail & Related papers (2021-10-07T18:29:06Z)
UPB at SemEval-2021 Task 5: Virtual Adversarial Training for Toxic Spans Detection [0.7197592390105455]
Semeval-2021, Task 5 - Toxic Spans Detection is based on a novel annotation of a subset of the Jigsaw Unintended Bias dataset. For this task, participants had to automatically detect character spans in short comments that render the message as toxic. Our model considers applying Virtual Adversarial Training in a semi-supervised setting during the fine-tuning process of several Transformer-based models.
arXiv Detail & Related papers (2021-04-17T19:42:12Z)
Language Models are Few-Shot Butlers [0.2538209532048867]
We introduce a two-stage procedure to learn from a small set of demonstrations and further improve by interacting with an environment. We show that language models fine-tuned with only 1.2% of the expert demonstrations and a simple reinforcement learning algorithm achieve a 51% absolute improvement in success rate over existing methods in the ALFWorld environment.
arXiv Detail & Related papers (2021-04-16T08:47:07Z)
A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation [53.8171136907856]
We introduce a set of simple yet effective data augmentation strategies dubbed cutoff. cutoff relies on sampling consistency and thus adds little computational overhead. cutoff consistently outperforms adversarial training and achieves state-of-the-art results on the IWSLT2014 German-English dataset.
arXiv Detail & Related papers (2020-09-29T07:08:35Z)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.