A little goes a long way: Improving toxic language classification
despite data scarcity
- URL: http://arxiv.org/abs/2009.12344v2
- Date: Sat, 24 Oct 2020 19:31:34 GMT
- Title: A little goes a long way: Improving toxic language classification
despite data scarcity
- Authors: Mika Juuti, Tommi Gr\"ondahl, Adrian Flanagan and N. Asokan
- Abstract summary: Detection of some types of toxic language is hampered by extreme scarcity of labeled training data.
Data augmentation - generating new synthetic data from a labeled seed dataset - can help.
We present the first systematic study on how data augmentation techniques impact performance across toxic language classifiers.
- Score: 13.21611612938414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detection of some types of toxic language is hampered by extreme scarcity of
labeled training data. Data augmentation - generating new synthetic data from a
labeled seed dataset - can help. The efficacy of data augmentation on toxic
language classification has not been fully explored. We present the first
systematic study on how data augmentation techniques impact performance across
toxic language classifiers, ranging from shallow logistic regression
architectures to BERT - a state-of-the-art pre-trained Transformer network. We
compare the performance of eight techniques on very scarce seed datasets. We
show that while BERT performed the best, shallow classifiers performed
comparably when trained on data augmented with a combination of three
techniques, including GPT-2-generated sentences. We discuss the interplay of
performance and computational overhead, which can inform the choice of
techniques under different constraints.
Related papers
- Artificial Data Point Generation in Clustered Latent Space for Small
Medical Datasets [4.542616945567623]
This paper introduces a novel method, Artificial Data Point Generation in Clustered Latent Space (AGCL)
AGCL is designed to enhance classification performance on small medical datasets through synthetic data generation.
It was applied to Parkinson's disease screening, utilizing facial expression data.
arXiv Detail & Related papers (2024-09-26T09:51:08Z) - Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by
Self-Supervised Representation Mixing and Embedding Initialization [57.38123229553157]
This paper presents an effective transfer learning framework for language adaptation in text-to-speech systems.
We focus on achieving language adaptation using minimal labeled and unlabeled data.
Experimental results show that our framework is able to synthesize intelligible speech in unseen languages with only 4 utterances of labeled data and 15 minutes of unlabeled data.
arXiv Detail & Related papers (2024-01-23T21:55:34Z) - Enhancing Sentiment Analysis Results through Outlier Detection
Optimization [0.5439020425819]
This study investigates the potential of identifying and addressing outliers in text data with subjective labels.
We utilize the Deep SVDD algorithm, a one-class classification method, to detect outliers in nine text-based emotion and sentiment analysis datasets.
arXiv Detail & Related papers (2023-11-25T18:20:43Z) - Text generation for dataset augmentation in security classification
tasks [55.70844429868403]
This study evaluates the application of natural language text generators to fill this data gap in multiple security-related text classification tasks.
We find substantial benefits for GPT-3 data augmentation strategies in situations with severe limitations on known positive-class samples.
arXiv Detail & Related papers (2023-10-22T22:25:14Z) - A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity [84.6421260559093]
This study is the largest set of experiments to validate, quantify, and expose undocumented intuitions about text pretraining.
Our findings indicate there does not exist a one-size-fits-all solution to filtering training data.
arXiv Detail & Related papers (2023-05-22T15:57:53Z) - Adversarial Word Dilution as Text Data Augmentation in Low-Resource
Regime [35.95241861664597]
This paper proposes an Adversarial Word Dilution (AWD) method that can generate hard positive examples as text data augmentations.
Our idea of augmenting the text data is to dilute the embedding of strong positive words by weighted mixing with unknown-word embedding.
Empirical studies on three benchmark datasets show that AWD can generate more effective data augmentations and outperform the state-of-the-art text data augmentation methods.
arXiv Detail & Related papers (2023-05-16T08:46:11Z) - On-the-fly Denoising for Data Augmentation in Natural Language
Understanding [101.46848743193358]
We propose an on-the-fly denoising technique for data augmentation that learns from soft augmented labels provided by an organic teacher model trained on the cleaner original data.
Our method can be applied to general augmentation techniques and consistently improve the performance on both text classification and question-answering tasks.
arXiv Detail & Related papers (2022-12-20T18:58:33Z) - DoubleMix: Simple Interpolation-Based Data Augmentation for Text
Classification [56.817386699291305]
This paper proposes a simple yet effective data augmentation approach termed DoubleMix.
DoubleMix first generates several perturbed samples for each training data.
It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
arXiv Detail & Related papers (2022-09-12T15:01:04Z) - Evaluating BERT-based Pre-training Language Models for Detecting
Misinformation [2.1915057426589746]
It is challenging to control the quality of online information due to the lack of supervision over all the information posted online.
There is a need for automated rumour detection techniques to limit the adverse effects of spreading misinformation.
This study proposes the BERT-based pre-trained language models to encode text data into vectors and utilise neural network models to classify these vectors to detect misinformation.
arXiv Detail & Related papers (2022-03-15T08:54:36Z) - Can We Achieve More with Less? Exploring Data Augmentation for Toxic
Comment Classification [0.0]
This paper tackles one of the greatest limitations in Machine Learning: Data Scarcity.
We explore whether high accuracy classifiers can be built from small datasets, utilizing a combination of data augmentation techniques and machine learning algorithms.
arXiv Detail & Related papers (2020-07-02T04:43:31Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.