Mitigating harm in language models with conditional-likelihood
filtration
- URL: http://arxiv.org/abs/2108.07790v1
- Date: Wed, 4 Aug 2021 22:18:10 GMT
- Title: Mitigating harm in language models with conditional-likelihood
filtration
- Authors: Helen Ngo, Cooper Raterink, Jo\~ao G.M. Ara\'ujo, Ivan Zhang, Carol
Chen, Adrien Morisot, Nicholas Frosst
- Abstract summary: We present a methodology for identifying harmful views from webscale unfiltered datasets.
We demonstrate that models trained on this filtered dataset exhibit lower propensity to generate harmful text.
We also discuss how trigger phrases which specific values can be used by researchers to build language models which are more closely aligned with their values.
- Score: 4.002298833349518
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models trained on large-scale unfiltered datasets curated from the
open web acquire systemic biases, prejudices, and harmful views from their
training data. We present a methodology for programmatically identifying and
removing harmful text from web-scale datasets. A pretrained language model is
used to calculate the log-likelihood of researcher-written trigger phrases
conditioned on a specific document, which is used to identify and filter
documents from the dataset. We demonstrate that models trained on this filtered
dataset exhibit lower propensity to generate harmful text, with a marginal
decrease in performance on standard language modeling benchmarks compared to
unfiltered baselines. We provide a partial explanation for this performance gap
by surfacing examples of hate speech and other undesirable content from
standard language modeling benchmarks. Finally, we discuss the generalization
of this method and how trigger phrases which reflect specific values can be
used by researchers to build language models which are more closely aligned
with their values.
Related papers
- Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data [3.2771631221674333]
We model the detection of topic-related content as a binary classification task.
Using only a few hundred annotated data points per topic, we detect content related to three German policies.
arXiv Detail & Related papers (2024-07-23T14:31:59Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Generating Enhanced Negatives for Training Language-Based Object Detectors [86.1914216335631]
We propose to leverage the vast knowledge built into modern generative models to automatically build negatives that are more relevant to the original data.
Specifically, we use large-language-models to generate negative text descriptions, and text-to-image diffusion models to also generate corresponding negative images.
Our experimental analysis confirms the relevance of the generated negative data, and its use in language-based detectors improves performance on two complex benchmarks.
arXiv Detail & Related papers (2023-12-29T23:04:00Z) - LatestEval: Addressing Data Contamination in Language Model Evaluation
through Dynamic and Time-Sensitive Test Construction [21.553915781660905]
LatestEval is an automatic method that leverages the most recent texts to create uncontaminated reading comprehension evaluations.
It avoids data contamination by only using texts published within a recent time window, ensuring no overlap with the training corpora of pre-trained language models.
Our experiments demonstrate that language models exhibit negligible memorisation behaviours on LatestEval as opposed to previous benchmarks.
arXiv Detail & Related papers (2023-12-19T17:16:43Z) - Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant.
Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z) - Recovering from Privacy-Preserving Masking with Large Language Models [14.828717714653779]
We use large language models (LLMs) to suggest substitutes of masked tokens.
We show that models trained on the obfuscation corpora are able to achieve comparable performance with the ones trained on the original data.
arXiv Detail & Related papers (2023-09-12T16:39:41Z) - Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding.
We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and
Semantic Parsing [55.058258437125524]
We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing.
We benchmark eight language models, including two GPT-3 variants available only through an API.
Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.
arXiv Detail & Related papers (2022-06-21T18:34:11Z) - Characteristics of Harmful Text: Towards Rigorous Benchmarking of
Language Models [32.960462266615096]
Large language models produce human-like text that drive a growing number of applications.
Recent literature and, increasingly, real world observations have demonstrated that these models can generate language that is toxic, biased, untruthful or otherwise harmful.
We outline six ways of characterizing harmful text which merit explicit consideration when designing new benchmarks.
arXiv Detail & Related papers (2022-06-16T17:28:01Z) - Coarse-to-Fine Memory Matching for Joint Retrieval and Classification [0.7081604594416339]
We present a novel end-to-end language model for joint retrieval and classification.
We evaluate it on the standard blind test set of the FEVER fact verification dataset.
We extend exemplar auditing to this setting for analyzing and constraining the model.
arXiv Detail & Related papers (2020-11-29T05:06:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.