Perplexed by Quality: A Perplexity-based Method for Adult and Harmful
Content Detection in Multilingual Heterogeneous Web Data
- URL: http://arxiv.org/abs/2212.10440v1
- Date: Tue, 20 Dec 2022 17:14:45 GMT
- Title: Perplexed by Quality: A Perplexity-based Method for Adult and Harmful
Content Detection in Multilingual Heterogeneous Web Data
- Authors: Tim Jansen, Yangling Tong, Victoria Zevallos, Pedro Ortiz Suarez
- Abstract summary: We explore different methods for detecting adult and harmful of content in multilingual heterogeneous web data.
We train solely with adult and harmful textual data, and then select the documents having a perplexity value above a given threshold.
This approach will virtually cluster our documents into two distinct groups, which will greatly facilitate the choice of the threshold for the perplexity.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As demand for large corpora increases with the size of current
state-of-the-art language models, using web data as the main part of the
pre-training corpus for these models has become a ubiquitous practice. This, in
turn, has introduced an important challenge for NLP practitioners, as they are
now confronted with the task of developing highly optimized models and
pipelines for pre-processing large quantities of textual data, which implies,
effectively classifying and filtering multilingual, heterogeneous and noisy
data, at web scale. One of the main components of this pre-processing step for
the pre-training corpora of large language models, is the removal of adult and
harmful content. In this paper we explore different methods for detecting adult
and harmful of content in multilingual heterogeneous web data. We first show
how traditional methods in harmful content detection, that seemingly perform
quite well in small and specialized datasets quickly break down when confronted
with heterogeneous noisy web data. We then resort to using a perplexity based
approach but with a twist: Instead of using a so-called "clean" corpus to train
a small language model and then use perplexity so select the documents with low
perplexity, i.e., the documents that resemble this so-called "clean" corpus the
most. We train solely with adult and harmful textual data, and then select the
documents having a perplexity value above a given threshold. This approach will
virtually cluster our documents into two distinct groups, which will greatly
facilitate the choice of the threshold for the perplexity and will also allow
us to obtain higher precision than with the traditional classification methods
for detecting adult and harmful content.
Related papers
- Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data [3.2771631221674333]
We model the detection of topic-related content as a binary classification task.
Using only a few hundred annotated data points per topic, we detect content related to three German policies.
arXiv Detail & Related papers (2024-07-23T14:31:59Z) - Summarization-based Data Augmentation for Document Classification [16.49709049899731]
We propose a simple yet effective summarization-based data augmentation, SUMMaug, for document classification.
We first obtain easy-to-learn examples for the target document classification task.
We then use the generated pseudo examples to perform curriculum learning.
arXiv Detail & Related papers (2023-12-01T11:34:37Z) - Sieve: Multimodal Dataset Pruning Using Image Captioning Models [11.362835828985494]
Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets.
We argue that this approach suffers from multiple limitations including false positives and negatives due to CLIP's pretraining on noisy labels.
We propose a pruning signal, Sieve, that employs synthetic captions generated by image-captioning models pretrained on small, diverse, and well-aligned image-text pairs.
arXiv Detail & Related papers (2023-10-03T14:53:53Z) - Curating corpora with classifiers: A case study of clean energy
sentiment online [0.0]
Large-scale corpora of social media posts contain broad public opinion.
Surveys can be expensive to run and lag public opinion by days or weeks.
We propose a method for rapidly selecting the best corpus of relevant documents for analysis.
arXiv Detail & Related papers (2023-05-04T18:15:45Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - What's in the Box? An Analysis of Undesirable Content in the Common
Crawl Corpus [77.34726150561087]
We analyze the Common Crawl, a colossal web corpus extensively used for training language models.
We find that it contains a significant amount of undesirable content, including hate speech and sexually explicit content, even after filtering procedures.
arXiv Detail & Related papers (2021-05-06T14:49:43Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection [0.0]
We propose an original framework, based on the Wikipedia Comment corpus, with comment-level annotations of different types.
This large corpus of more than 380k annotated messages opens perspectives for online abuse detection and especially for context-based approaches.
We also propose, in addition to this corpus, a complete benchmarking platform to stimulate and fairly compare scientific works around the problem of content abuse detection.
arXiv Detail & Related papers (2020-03-13T10:26:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.