You Only Prompt Once: On the Capabilities of Prompt Learning on Large
Language Models to Tackle Toxic Content
- URL: http://arxiv.org/abs/2308.05596v1
- Date: Thu, 10 Aug 2023 14:14:13 GMT
- Title: You Only Prompt Once: On the Capabilities of Prompt Learning on Large
Language Models to Tackle Toxic Content
- Authors: Xinlei He and Savvas Zannettou and Yun Shen and Yang Zhang
- Abstract summary: We investigate how we can use large language models (LLMs) to tackle the problem of toxic content online.
We focus on three tasks; 1) Toxicity Classification, 2) Toxic Span Detection, and 3) Detoxification.
We find that prompt learning achieves around 10% improvement in the toxicity classification task compared to the baselines.
- Score: 13.600755614321493
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The spread of toxic content online is an important problem that has adverse
effects on user experience online and in our society at large. Motivated by the
importance and impact of the problem, research focuses on developing solutions
to detect toxic content, usually leveraging machine learning (ML) models
trained on human-annotated datasets. While these efforts are important, these
models usually do not generalize well and they can not cope with new trends
(e.g., the emergence of new toxic terms). Currently, we are witnessing a shift
in the approach to tackling societal issues online, particularly leveraging
large language models (LLMs) like GPT-3 or T5 that are trained on vast corpora
and have strong generalizability. In this work, we investigate how we can use
LLMs and prompt learning to tackle the problem of toxic content, particularly
focusing on three tasks; 1) Toxicity Classification, 2) Toxic Span Detection,
and 3) Detoxification. We perform an extensive evaluation over five model
architectures and eight datasets demonstrating that LLMs with prompt learning
can achieve similar or even better performance compared to models trained on
these specific tasks. We find that prompt learning achieves around 10\%
improvement in the toxicity classification task compared to the baselines,
while for the toxic span detection task we find better performance to the best
baseline (0.643 vs. 0.640 in terms of $F_1$-score). Finally, for the
detoxification task, we find that prompt learning can successfully reduce the
average toxicity score (from 0.775 to 0.213) while preserving semantic meaning.
Related papers
- AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.
Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.
We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z) - Realistic Evaluation of Toxicity in Large Language Models [28.580995165272086]
Large language models (LLMs) have become integral to our professional and daily lives.
The huge amount of data which endows them with vast and diverse knowledge exposes them to the inevitable toxicity and bias.
This paper introduces the new Thoroughly Engineered Toxicity dataset, comprising manually crafted prompts.
arXiv Detail & Related papers (2024-05-17T09:42:59Z) - Detoxifying Large Language Models via Knowledge Editing [57.0669577257301]
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs)
We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts.
We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently.
arXiv Detail & Related papers (2024-03-21T15:18:30Z) - TriSum: Learning Summarization Ability from Large Language Models with Structured Rationale [66.01943465390548]
We introduce TriSum, a framework for distilling large language models' text summarization abilities into a compact, local model.
Our method enhances local model performance on various benchmarks.
It also improves interpretability by providing insights into the summarization rationale.
arXiv Detail & Related papers (2024-03-15T14:36:38Z) - Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use.
We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting.
We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z) - Adding Instructions during Pretraining: Effective Way of Controlling
Toxicity in Language Models [29.505176809305095]
We propose two novel pretraining data augmentation strategies that significantly reduce model toxicity without compromising its utility.
Our two strategies are: (1) MEDA: adds raw toxicity score as meta-data to the pretraining samples, and (2) INST: adds instructions to those samples indicating their toxicity.
Our results indicate that our best performing strategy (INST) substantially reduces the toxicity probability up to 61% while preserving the accuracy on five benchmark NLP tasks.
arXiv Detail & Related papers (2023-02-14T23:00:42Z) - Exploring the Limits of Domain-Adaptive Training for Detoxifying
Large-Scale Language Models [84.30718841659531]
We explore domain-adaptive training to reduce the toxicity of language models.
For the training corpus, we propose to leverage the generative power of LMs.
We then comprehensively study LMs with parameter sizes ranging from 126M up to 530B, a scale that has never been studied before.
arXiv Detail & Related papers (2022-02-08T22:10:40Z) - Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations [15.152559543181523]
This study is the first to investigate the effect of adversarial behavior and augmentation for cyberbullying detection.
We demonstrate that model-agnostic lexical substitutions significantly hurt performance.
Augmentations proposed in prior work on toxicity prove to be less effective.
arXiv Detail & Related papers (2022-01-17T12:48:27Z) - ToxCCIn: Toxic Content Classification with Interpretability [16.153683223016973]
Explanations are important for tasks like offensive language or toxicity detection on social media.
We propose a technique to improve the interpretability of transformer models, based on a simple and powerful assumption.
We find this approach effective and can produce explanations that exceed the quality of those provided by Logistic Regression analysis.
arXiv Detail & Related papers (2021-03-01T22:17:10Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.