Critical Perspectives: A Benchmark Revealing Pitfalls in PerspectiveAPI
- URL: http://arxiv.org/abs/2301.01874v1
- Date: Thu, 5 Jan 2023 02:12:47 GMT
- Title: Critical Perspectives: A Benchmark Revealing Pitfalls in PerspectiveAPI
- Authors: Lorena Piedras, Lucas Rosenblatt, Julia Wilkins
- Abstract summary: We focus on PERSPECTIVE from Jigsaw, a tool that promises to score the "toxicity" of text.
We propose a new benchmark, Selected Adversarial Semantics, or SASS.
We find that PERSPECTIVE exhibits troubling shortcomings across a number of our toxicity categories.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Detecting "toxic" language in internet content is a pressing social and
technical challenge. In this work, we focus on PERSPECTIVE from Jigsaw, a
state-of-the-art tool that promises to score the "toxicity" of text, with a
recent model update that claims impressive results (Lees et al., 2022). We seek
to challenge certain normative claims about toxic language by proposing a new
benchmark, Selected Adversarial SemanticS, or SASS. We evaluate PERSPECTIVE on
SASS, and compare to low-effort alternatives, like zero-shot and few-shot GPT-3
prompt models, in binary classification settings. We find that PERSPECTIVE
exhibits troubling shortcomings across a number of our toxicity categories.
SASS provides a new tool for evaluating performance on previously undetected
toxic language that avoids common normative pitfalls. Our work leads us to
emphasize the importance of questioning assumptions made by tools already in
deployment for toxicity detection in order to anticipate and prevent disparate
harms.
Related papers
- Towards Building a Robust Toxicity Predictor [13.162016701556725]
This paper presents a novel adversarial attack, texttToxicTrap, introducing small word-level perturbations to fool SOTA text classifiers to predict toxic text samples as benign.
Two novel goal function designs allow ToxicTrap to identify weaknesses in both multiclass and multilabel toxic language detectors.
arXiv Detail & Related papers (2024-04-09T22:56:05Z) - DPP-Based Adversarial Prompt Searching for Lanugage Models [56.73828162194457]
Auto-regressive Selective Replacement Ascent (ASRA) is a discrete optimization algorithm that selects prompts based on both quality and similarity with determinantal point process (DPP)
Experimental results on six different pre-trained language models demonstrate the efficacy of ASRA for eliciting toxic content.
arXiv Detail & Related papers (2024-03-01T05:28:06Z) - ToxiSpanSE: An Explainable Toxicity Detection in Code Review Comments [4.949881799107062]
ToxiSpanSE is the first tool to detect toxic spans in the Software Engineering (SE) domain.
Our model achieved the best score with 0.88 $F1$, 0.87 precision, and 0.93 recall for toxic class tokens.
arXiv Detail & Related papers (2023-07-07T04:55:11Z) - Characteristics of Harmful Text: Towards Rigorous Benchmarking of
Language Models [32.960462266615096]
Large language models produce human-like text that drive a growing number of applications.
Recent literature and, increasingly, real world observations have demonstrated that these models can generate language that is toxic, biased, untruthful or otherwise harmful.
We outline six ways of characterizing harmful text which merit explicit consideration when designing new benchmarks.
arXiv Detail & Related papers (2022-06-16T17:28:01Z) - Toxicity Detection with Generative Prompt-based Inference [3.9741109244650823]
It is a long-known risk that language models (LMs), once trained on corpus containing undesirable content, have the power to manifest biases and toxicity.
In this work, we explore the generative variant of zero-shot prompt-based toxicity detection with comprehensive trials on prompt engineering.
arXiv Detail & Related papers (2022-05-24T22:44:43Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Toxicity Detection can be Sensitive to the Conversational Context [64.28043776806213]
We construct and publicly release a dataset of 10,000 posts with two kinds of toxicity labels.
We introduce a new task, context sensitivity estimation, which aims to identify posts whose perceived toxicity changes if the context is also considered.
arXiv Detail & Related papers (2021-11-19T13:57:26Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for
Detecting Toxic Spans [2.4737119633827174]
In recent years, the widespread use of social media has led to an increase in the generation of toxic and offensive content on online platforms.
Social media platforms have worked on developing automatic detection methods and employing human moderators to cope with this deluge of offensive content.
arXiv Detail & Related papers (2021-04-09T22:52:26Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.