Characteristics of Harmful Text: Towards Rigorous Benchmarking of
Language Models
- URL: http://arxiv.org/abs/2206.08325v2
- Date: Fri, 28 Oct 2022 17:55:58 GMT
- Title: Characteristics of Harmful Text: Towards Rigorous Benchmarking of
Language Models
- Authors: Maribeth Rauh, John Mellor, Jonathan Uesato, Po-Sen Huang, Johannes
Welbl, Laura Weidinger, Sumanth Dathathri, Amelia Glaese, Geoffrey Irving,
Iason Gabriel, William Isaac, Lisa Anne Hendricks
- Abstract summary: Large language models produce human-like text that drive a growing number of applications.
Recent literature and, increasingly, real world observations have demonstrated that these models can generate language that is toxic, biased, untruthful or otherwise harmful.
We outline six ways of characterizing harmful text which merit explicit consideration when designing new benchmarks.
- Score: 32.960462266615096
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models produce human-like text that drive a growing number of
applications. However, recent literature and, increasingly, real world
observations, have demonstrated that these models can generate language that is
toxic, biased, untruthful or otherwise harmful. Though work to evaluate
language model harms is under way, translating foresight about which harms may
arise into rigorous benchmarks is not straightforward. To facilitate this
translation, we outline six ways of characterizing harmful text which merit
explicit consideration when designing new benchmarks. We then use these
characteristics as a lens to identify trends and gaps in existing benchmarks.
Finally, we apply them in a case study of the Perspective API, a toxicity
classifier that is widely used in harm benchmarks. Our characteristics provide
one piece of the bridge that translates between foresight and effective
evaluation.
Related papers
- Generating Enhanced Negatives for Training Language-Based Object Detectors [86.1914216335631]
We propose to leverage the vast knowledge built into modern generative models to automatically build negatives that are more relevant to the original data.
Specifically, we use large-language-models to generate negative text descriptions, and text-to-image diffusion models to also generate corresponding negative images.
Our experimental analysis confirms the relevance of the generated negative data, and its use in language-based detectors improves performance on two complex benchmarks.
arXiv Detail & Related papers (2023-12-29T23:04:00Z) - Quantifying the Plausibility of Context Reliance in Neural Machine
Translation [25.29330352252055]
We introduce Plausibility Evaluation of Context Reliance (PECoRe)
PECoRe is an end-to-end interpretability framework designed to quantify context usage in language models' generations.
We use pecore to quantify the plausibility of context-aware machine translation models.
arXiv Detail & Related papers (2023-10-02T13:26:43Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - On The Ingredients of an Effective Zero-shot Semantic Parser [95.01623036661468]
We analyze zero-shot learning by paraphrasing training examples of canonical utterances and programs from a grammar.
We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods.
Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.
arXiv Detail & Related papers (2021-10-15T21:41:16Z) - Mitigating harm in language models with conditional-likelihood
filtration [4.002298833349518]
We present a methodology for identifying harmful views from webscale unfiltered datasets.
We demonstrate that models trained on this filtered dataset exhibit lower propensity to generate harmful text.
We also discuss how trigger phrases which specific values can be used by researchers to build language models which are more closely aligned with their values.
arXiv Detail & Related papers (2021-08-04T22:18:10Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z) - Are Some Words Worth More than Others? [3.5598388686985354]
We propose two new intrinsic evaluation measures within the framework of a simple word prediction task.
We evaluate several commonly-used large English language models using our proposed metrics.
arXiv Detail & Related papers (2020-10-12T23:12:11Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.