Can We Achieve More with Less? Exploring Data Augmentation for Toxic
Comment Classification
- URL: http://arxiv.org/abs/2007.00875v1
- Date: Thu, 2 Jul 2020 04:43:31 GMT
- Title: Can We Achieve More with Less? Exploring Data Augmentation for Toxic
Comment Classification
- Authors: Chetanya Rastogi, Nikka Mofid, Fang-I Hsiao
- Abstract summary: This paper tackles one of the greatest limitations in Machine Learning: Data Scarcity.
We explore whether high accuracy classifiers can be built from small datasets, utilizing a combination of data augmentation techniques and machine learning algorithms.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper tackles one of the greatest limitations in Machine Learning: Data
Scarcity. Specifically, we explore whether high accuracy classifiers can be
built from small datasets, utilizing a combination of data augmentation
techniques and machine learning algorithms. In this paper, we experiment with
Easy Data Augmentation (EDA) and Backtranslation, as well as with three popular
learning algorithms, Logistic Regression, Support Vector Machine (SVM), and
Bidirectional Long Short-Term Memory Network (Bi-LSTM). For our
experimentation, we utilize the Wikipedia Toxic Comments dataset so that in the
process of exploring the benefits of data augmentation, we can develop a model
to detect and classify toxic speech in comments to help fight back against
cyberbullying and online harassment. Ultimately, we found that data
augmentation techniques can be used to significantly boost the performance of
classifiers and are an excellent strategy to combat lack of data in NLP
problems.
Related papers
- A Study of Data Augmentation Techniques to Overcome Data Scarcity in Wound Classification using Deep Learning [0.0]
We show that data augmentation can improve classification performance, F1 scores, by up to 11% on top of state-of-the-art models.
Our experiments with GAN based augmentation prove the viability of using DE-GANs to generate wound images with richer variations.
arXiv Detail & Related papers (2024-11-04T00:24:50Z) - Towards Robust and Cost-Efficient Knowledge Unlearning for Large Language Models [25.91643745340183]
Large Language Models (LLMs) have demonstrated strong reasoning and memorization capabilities via pretraining on massive textual corpora.
This poses risk of privacy and copyright violations, highlighting the need for efficient machine unlearning methods.
We propose two novel techniques for robust and efficient unlearning for LLMs.
arXiv Detail & Related papers (2024-08-13T04:18:32Z) - Enhancing Sentiment Analysis Results through Outlier Detection
Optimization [0.5439020425819]
This study investigates the potential of identifying and addressing outliers in text data with subjective labels.
We utilize the Deep SVDD algorithm, a one-class classification method, to detect outliers in nine text-based emotion and sentiment analysis datasets.
arXiv Detail & Related papers (2023-11-25T18:20:43Z) - Advanced Data Augmentation Approaches: A Comprehensive Survey and Future
directions [57.30984060215482]
We provide a background of data augmentation, a novel and comprehensive taxonomy of reviewed data augmentation techniques, and the strengths and weaknesses (wherever possible) of each technique.
We also provide comprehensive results of the data augmentation effect on three popular computer vision tasks, such as image classification, object detection and semantic segmentation.
arXiv Detail & Related papers (2023-01-07T11:37:32Z) - Self-omics: A Self-supervised Learning Framework for Multi-omics Cancer
Data [4.843654097048771]
Self-Supervised Learning (SSL) methods are typically used to deal with limited labelled data.
We develop a novel pre-training paradigm that consists of various SSL components.
Our approach outperforms the state-of-the-art method in cancer type classification on the TCGA pan-cancer dataset.
arXiv Detail & Related papers (2022-10-03T11:20:12Z) - Few-Shot Class-Incremental Learning via Entropy-Regularized Data-Free
Replay [52.251188477192336]
Few-shot class-incremental learning (FSCIL) has been proposed aiming to enable a deep learning system to incrementally learn new classes with limited data.
We show through empirical results that adopting the data replay is surprisingly favorable.
We propose using data-free replay that can synthesize data by a generator without accessing real data.
arXiv Detail & Related papers (2022-07-22T17:30:51Z) - Few-Shot Non-Parametric Learning with Deep Latent Variable Model [50.746273235463754]
We propose Non-Parametric learning by Compression with Latent Variables (NPC-LV)
NPC-LV is a learning framework for any dataset with abundant unlabeled data but very few labeled ones.
We show that NPC-LV outperforms supervised methods on all three datasets on image classification in low data regime.
arXiv Detail & Related papers (2022-06-23T09:35:03Z) - Weakly Supervised Change Detection Using Guided Anisotropic Difusion [97.43170678509478]
We propose original ideas that help us to leverage such datasets in the context of change detection.
First, we propose the guided anisotropic diffusion (GAD) algorithm, which improves semantic segmentation results.
We then show its potential in two weakly-supervised learning strategies tailored for change detection.
arXiv Detail & Related papers (2021-12-31T10:03:47Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - CvS: Classification via Segmentation For Small Datasets [52.821178654631254]
This paper presents CvS, a cost-effective classifier for small datasets that derives the classification labels from predicting the segmentation maps.
We evaluate the effectiveness of our framework on diverse problems showing that CvS is able to achieve much higher classification results compared to previous methods when given only a handful of examples.
arXiv Detail & Related papers (2021-10-29T18:41:15Z) - A little goes a long way: Improving toxic language classification
despite data scarcity [13.21611612938414]
Detection of some types of toxic language is hampered by extreme scarcity of labeled training data.
Data augmentation - generating new synthetic data from a labeled seed dataset - can help.
We present the first systematic study on how data augmentation techniques impact performance across toxic language classifiers.
arXiv Detail & Related papers (2020-09-25T17:04:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.