You Only Prompt Once: On the Capabilities of Prompt Learning on Large
Language Models to Tackle Toxic Content
- URL: http://arxiv.org/abs/2308.05596v1
- Date: Thu, 10 Aug 2023 14:14:13 GMT
- Title: You Only Prompt Once: On the Capabilities of Prompt Learning on Large
Language Models to Tackle Toxic Content
- Authors: Xinlei He and Savvas Zannettou and Yun Shen and Yang Zhang
- Abstract summary: We investigate how we can use large language models (LLMs) to tackle the problem of toxic content online.
We focus on three tasks; 1) Toxicity Classification, 2) Toxic Span Detection, and 3) Detoxification.
We find that prompt learning achieves around 10% improvement in the toxicity classification task compared to the baselines.
- Score: 13.600755614321493
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The spread of toxic content online is an important problem that has adverse
effects on user experience online and in our society at large. Motivated by the
importance and impact of the problem, research focuses on developing solutions
to detect toxic content, usually leveraging machine learning (ML) models
trained on human-annotated datasets. While these efforts are important, these
models usually do not generalize well and they can not cope with new trends
(e.g., the emergence of new toxic terms). Currently, we are witnessing a shift
in the approach to tackling societal issues online, particularly leveraging
large language models (LLMs) like GPT-3 or T5 that are trained on vast corpora
and have strong generalizability. In this work, we investigate how we can use
LLMs and prompt learning to tackle the problem of toxic content, particularly
focusing on three tasks; 1) Toxicity Classification, 2) Toxic Span Detection,
and 3) Detoxification. We perform an extensive evaluation over five model
architectures and eight datasets demonstrating that LLMs with prompt learning
can achieve similar or even better performance compared to models trained on
these specific tasks. We find that prompt learning achieves around 10\%
improvement in the toxicity classification task compared to the baselines,
while for the toxic span detection task we find better performance to the best
baseline (0.643 vs. 0.640 in terms of $F_1$-score). Finally, for the
detoxification task, we find that prompt learning can successfully reduce the
average toxicity score (from 0.775 to 0.213) while preserving semantic meaning.
Related papers
- Toxic Subword Pruning for Dialogue Response Generation on Large Language Models [51.713448010799986]
We propose textbfToxic Subword textbfPruning (ToxPrune) to prune the subword contained by the toxic words from BPE in trained LLMs.
ToxPrune simultaneously improves the toxic language model NSFW-3B on the task of dialogue response generation obviously.
arXiv Detail & Related papers (2024-10-05T13:30:33Z) - Large Language Models can be Strong Self-Detoxifiers [82.6594169242814]
Self-disciplined Autoregressive Sampling (SASA) is a lightweight controlled decoding algorithm for toxicity reduction of large language models (LLMs)
SASA tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy.
evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks.
arXiv Detail & Related papers (2024-10-04T17:45:15Z) - Realistic Evaluation of Toxicity in Large Language Models [28.580995165272086]
Large language models (LLMs) have become integral to our professional and daily lives.
The huge amount of data which endows them with vast and diverse knowledge exposes them to the inevitable toxicity and bias.
This paper introduces the new Thoroughly Engineered Toxicity dataset, comprising manually crafted prompts.
arXiv Detail & Related papers (2024-05-17T09:42:59Z) - TriSum: Learning Summarization Ability from Large Language Models with Structured Rationale [66.01943465390548]
We introduce TriSum, a framework for distilling large language models' text summarization abilities into a compact, local model.
Our method enhances local model performance on various benchmarks.
It also improves interpretability by providing insights into the summarization rationale.
arXiv Detail & Related papers (2024-03-15T14:36:38Z) - Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use.
We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting.
We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z) - Adding Instructions during Pretraining: Effective Way of Controlling
Toxicity in Language Models [29.505176809305095]
We propose two novel pretraining data augmentation strategies that significantly reduce model toxicity without compromising its utility.
Our two strategies are: (1) MEDA: adds raw toxicity score as meta-data to the pretraining samples, and (2) INST: adds instructions to those samples indicating their toxicity.
Our results indicate that our best performing strategy (INST) substantially reduces the toxicity probability up to 61% while preserving the accuracy on five benchmark NLP tasks.
arXiv Detail & Related papers (2023-02-14T23:00:42Z) - Exploring the Limits of Domain-Adaptive Training for Detoxifying
Large-Scale Language Models [84.30718841659531]
We explore domain-adaptive training to reduce the toxicity of language models.
For the training corpus, we propose to leverage the generative power of LMs.
We then comprehensively study LMs with parameter sizes ranging from 126M up to 530B, a scale that has never been studied before.
arXiv Detail & Related papers (2022-02-08T22:10:40Z) - Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations [15.152559543181523]
This study is the first to investigate the effect of adversarial behavior and augmentation for cyberbullying detection.
We demonstrate that model-agnostic lexical substitutions significantly hurt performance.
Augmentations proposed in prior work on toxicity prove to be less effective.
arXiv Detail & Related papers (2022-01-17T12:48:27Z) - ToxCCIn: Toxic Content Classification with Interpretability [16.153683223016973]
Explanations are important for tasks like offensive language or toxicity detection on social media.
We propose a technique to improve the interpretability of transformer models, based on a simple and powerful assumption.
We find this approach effective and can produce explanations that exceed the quality of those provided by Logistic Regression analysis.
arXiv Detail & Related papers (2021-03-01T22:17:10Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.