Mitigating harm in language models with conditional-likelihood
filtration
- URL: http://arxiv.org/abs/2108.07790v1
- Date: Wed, 4 Aug 2021 22:18:10 GMT
- Title: Mitigating harm in language models with conditional-likelihood
filtration
- Authors: Helen Ngo, Cooper Raterink, Jo\~ao G.M. Ara\'ujo, Ivan Zhang, Carol
Chen, Adrien Morisot, Nicholas Frosst
- Abstract summary: We present a methodology for identifying harmful views from webscale unfiltered datasets.
We demonstrate that models trained on this filtered dataset exhibit lower propensity to generate harmful text.
We also discuss how trigger phrases which specific values can be used by researchers to build language models which are more closely aligned with their values.
- Score: 4.002298833349518
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models trained on large-scale unfiltered datasets curated from the
open web acquire systemic biases, prejudices, and harmful views from their
training data. We present a methodology for programmatically identifying and
removing harmful text from web-scale datasets. A pretrained language model is
used to calculate the log-likelihood of researcher-written trigger phrases
conditioned on a specific document, which is used to identify and filter
documents from the dataset. We demonstrate that models trained on this filtered
dataset exhibit lower propensity to generate harmful text, with a marginal
decrease in performance on standard language modeling benchmarks compared to
unfiltered baselines. We provide a partial explanation for this performance gap
by surfacing examples of hate speech and other undesirable content from
standard language modeling benchmarks. Finally, we discuss the generalization
of this method and how trigger phrases which reflect specific values can be
used by researchers to build language models which are more closely aligned
with their values.
Related papers
- The Empirical Impact of Data Sanitization on Language Models [1.1359551336076306]
This paper empirically analyzes the effects of data sanitization across several benchmark language-modeling tasks.
Our results suggest that for some tasks such as sentiment analysis or entailment, the impact of redaction is quite low, typically around 1-5%.
For tasks such as comprehension Q&A there is a big drop of >25% in performance observed in redacted queries as compared to the original.
arXiv Detail & Related papers (2024-11-08T21:22:37Z) - Mitigating Biases to Embrace Diversity: A Comprehensive Annotation Benchmark for Toxic Language [0.0]
This study introduces a prescriptive annotation benchmark grounded in humanities research to ensure consistent, unbiased labeling of offensive language.
We contribute two newly annotated datasets that achieve higher inter-annotator agreement between human and language model (LLM) annotations.
arXiv Detail & Related papers (2024-10-17T08:10:24Z) - ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws [67.59263833387536]
ScalingFilter is a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data.
To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations.
arXiv Detail & Related papers (2024-08-15T17:59:30Z) - Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data [3.2771631221674333]
We model the detection of topic-related content as a binary classification task.
Using only a few hundred annotated data points per topic, we detect content related to three German policies.
arXiv Detail & Related papers (2024-07-23T14:31:59Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - LatestEval: Addressing Data Contamination in Language Model Evaluation
through Dynamic and Time-Sensitive Test Construction [21.553915781660905]
LatestEval is an automatic method that leverages the most recent texts to create uncontaminated reading comprehension evaluations.
It avoids data contamination by only using texts published within a recent time window, ensuring no overlap with the training corpora of pre-trained language models.
Our experiments demonstrate that language models exhibit negligible memorisation behaviours on LatestEval as opposed to previous benchmarks.
arXiv Detail & Related papers (2023-12-19T17:16:43Z) - Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant.
Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z) - Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding.
We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and
Semantic Parsing [55.058258437125524]
We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing.
We benchmark eight language models, including two GPT-3 variants available only through an API.
Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.
arXiv Detail & Related papers (2022-06-21T18:34:11Z) - Characteristics of Harmful Text: Towards Rigorous Benchmarking of
Language Models [32.960462266615096]
Large language models produce human-like text that drive a growing number of applications.
Recent literature and, increasingly, real world observations have demonstrated that these models can generate language that is toxic, biased, untruthful or otherwise harmful.
We outline six ways of characterizing harmful text which merit explicit consideration when designing new benchmarks.
arXiv Detail & Related papers (2022-06-16T17:28:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.