Challenges in Detoxifying Language Models
- URL: http://arxiv.org/abs/2109.07445v1
- Date: Wed, 15 Sep 2021 17:27:06 GMT
- Title: Challenges in Detoxifying Language Models
- Authors: Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri,
John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben
Coppin, Po-Sen Huang
- Abstract summary: Large language models (LM) generate remarkably fluent text and can be efficiently adapted across NLP tasks.
Measuring and guaranteeing the quality of generated text in terms of safety is imperative for deploying LMs in the real world.
We evaluate several toxicity mitigation strategies with respect to both automatic and human evaluation.
- Score: 44.48396735574315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LM) generate remarkably fluent text and can be
efficiently adapted across NLP tasks. Measuring and guaranteeing the quality of
generated text in terms of safety is imperative for deploying LMs in the real
world; to this end, prior work often relies on automatic evaluation of LM
toxicity. We critically discuss this approach, evaluate several toxicity
mitigation strategies with respect to both automatic and human evaluation, and
analyze consequences of toxicity mitigation in terms of model bias and LM
quality. We demonstrate that while basic intervention strategies can
effectively optimize previously established automatic metrics on the
RealToxicityPrompts dataset, this comes at the cost of reduced LM coverage for
both texts about, and dialects of, marginalized groups. Additionally, we find
that human raters often disagree with high automatic toxicity scores after
strong toxicity reduction interventions -- highlighting further the nuances
involved in careful evaluation of LM toxicity.
Related papers
- Large Language Models can be Strong Self-Detoxifiers [82.6594169242814]
Self-disciplined Autoregressive Sampling (SASA) is a lightweight controlled decoding algorithm for toxicity reduction of large language models (LLMs)
SASA tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy.
evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks.
arXiv Detail & Related papers (2024-10-04T17:45:15Z) - Realistic Evaluation of Toxicity in Large Language Models [28.580995165272086]
Large language models (LLMs) have become integral to our professional and daily lives.
The huge amount of data which endows them with vast and diverse knowledge exposes them to the inevitable toxicity and bias.
This paper introduces the new Thoroughly Engineered Toxicity dataset, comprising manually crafted prompts.
arXiv Detail & Related papers (2024-05-17T09:42:59Z) - PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models [27.996123856250065]
Existing toxicity benchmarks are overwhelmingly focused on English.
We introduce PolygloToxicityPrompts (PTP), the first large-scale multilingual toxicity evaluation benchmark of 425K naturally occurring prompts spanning 17 languages.
arXiv Detail & Related papers (2024-05-15T14:22:33Z) - Detoxifying Large Language Models via Knowledge Editing [57.0669577257301]
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs)
We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts.
We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently.
arXiv Detail & Related papers (2024-03-21T15:18:30Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - Detoxifying Text with MaRCo: Controllable Revision with Experts and
Anti-Experts [57.38912708076231]
We introduce MaRCo, a detoxification algorithm that combines controllable generation and text rewriting methods.
MaRCo uses likelihoods under a non-toxic LM and a toxic LM to find candidate words to mask and potentially replace.
We evaluate our method on several subtle toxicity and microaggressions datasets, and show that it not only outperforms baselines on automatic metrics, but MaRCo's rewrites are preferred 2.1 $times$ more in human evaluation.
arXiv Detail & Related papers (2022-12-20T18:50:00Z) - Detoxifying Language Models Risks Marginalizing Minority Voices [40.918564746367586]
Language models (LMs) must be both safe and equitable to be responsibly deployed in practice.
detoxification techniques have been proposed to mitigate toxic LM generations.
We show that current detoxification techniques hurt equity: they decrease the utility of LMs on language used by marginalized groups.
arXiv Detail & Related papers (2021-04-13T17:52:01Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.