Enhanced Offensive Language Detection Through Data Augmentation
- URL: http://arxiv.org/abs/2012.02954v1
- Date: Sat, 5 Dec 2020 05:45:16 GMT
- Title: Enhanced Offensive Language Detection Through Data Augmentation
- Authors: Ruibo Liu, Guangxuan Xu, Soroush Vosoughi
- Abstract summary: The ICWSM-2020 Data Challenge Task 2 is aimed at identifying offensive content using a crowd-sourced dataset containing 100k labelled tweets.
The dataset suffers from class imbalance, where certain labels are extremely rare compared with other classes.
We present Dager, a generation-based data augmentation method, that improves the performance of classification on imbalanced and low-resource data.
- Score: 2.2022484178680872
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Detecting offensive language on social media is an important task. The
ICWSM-2020 Data Challenge Task 2 is aimed at identifying offensive content
using a crowd-sourced dataset containing 100k labelled tweets. The dataset,
however, suffers from class imbalance, where certain labels are extremely rare
compared with other classes (e.g, the hateful class is only 5% of the data). In
this work, we present Dager (Data Augmenter), a generation-based data
augmentation method, that improves the performance of classification on
imbalanced and low-resource data such as the offensive language dataset. Dager
extracts the lexical features of a given class, and uses these features to
guide the generation of a conditional generator built on GPT-2. The generated
text can then be added to the training set as augmentation data. We show that
applying Dager can increase the F1 score of the data challenge by 11% when we
use 1% of the whole dataset for training (using BERT for classification);
moreover, the generated data also preserves the original labels very well. We
test Dager on four different classifiers (BERT, CNN, Bi-LSTM with attention,
and Transformer), observing universal improvement on the detection, indicating
our method is effective and classifier-agnostic.
Related papers
- Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - PromptMix: A Class Boundary Augmentation Method for Large Language Model
Distillation [19.351192775314612]
We propose a method to generate more helpful augmented data by utilizing the LLM's abilities to follow instructions and perform few-shot classifications.
Our specific PromptMix method consists of two steps: 1) generate challenging text augmentations near class boundaries; however, generating borderline examples increases the risk of false positives in the dataset.
We evaluate the proposed method in challenging 2-shot and zero-shot settings on four text classification datasets: Banking77, TREC6, Subjectivity (SUBJ) and Twitter Complaints.
arXiv Detail & Related papers (2023-10-22T05:43:23Z) - Improving Classifier Robustness through Active Generation of Pairwise
Counterfactuals [22.916599410472102]
We present a novel framework that utilizes counterfactual generative models to generate a large number of diverse counterfactuals.
We show that with a small amount of human-annotated counterfactual data (10%), we can generate a counterfactual augmentation dataset with learned labels.
arXiv Detail & Related papers (2023-05-22T23:19:01Z) - AugGPT: Leveraging ChatGPT for Text Data Augmentation [59.76140039943385]
We propose a text data augmentation approach based on ChatGPT (named AugGPT)
AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples.
Experiment results on few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach.
arXiv Detail & Related papers (2023-02-25T06:58:16Z) - A new data augmentation method for intent classification enhancement and
its application on spoken conversation datasets [23.495743195811375]
We present the Nearest Neighbors Scores Improvement (NNSI) algorithm for automatic data selection and labeling.
The NNSI reduces the need for manual labeling by automatically selecting highly-ambiguous samples and labeling them with high accuracy.
We demonstrated the use of NNSI on two large-scale, real-life voice conversation systems.
arXiv Detail & Related papers (2022-02-21T11:36:19Z) - Unsupervised Selective Labeling for More Effective Semi-Supervised
Learning [46.414510522978425]
unsupervised selective labeling consistently improves SSL methods over state-of-the-art active learning given labeled data.
Our work sets a new standard for practical and efficient SSL.
arXiv Detail & Related papers (2021-10-06T18:25:50Z) - Detecting Handwritten Mathematical Terms with Sensor Based Data [71.84852429039881]
We propose a solution to the UbiComp 2021 Challenge by Stabilo in which handwritten mathematical terms are supposed to be automatically classified.
The input data set contains data of different writers, with label strings constructed from a total of 15 different possible characters.
arXiv Detail & Related papers (2021-09-12T19:33:34Z) - Robustness to Spurious Correlations in Text Classification via
Automatically Generated Counterfactuals [8.827892752465958]
We propose to train a robust text classifier by augmenting the training data with automatically generated counterfactual data.
We show that the robust classifier makes meaningful and trustworthy predictions by emphasizing causal features and de-emphasizing non-causal features.
arXiv Detail & Related papers (2020-12-18T03:57:32Z) - FIND: Human-in-the-Loop Debugging Deep Text Classifiers [55.135620983922564]
We propose FIND -- a framework which enables humans to debug deep learning text classifiers by disabling irrelevant hidden features.
Experiments show that by using FIND, humans can improve CNN text classifiers which were trained under different types of imperfect datasets.
arXiv Detail & Related papers (2020-10-10T12:52:53Z) - Classify and Generate Reciprocally: Simultaneous Positive-Unlabelled
Learning and Conditional Generation with Extra Data [77.31213472792088]
The scarcity of class-labeled data is a ubiquitous bottleneck in many machine learning problems.
We address this problem by leveraging Positive-Unlabeled(PU) classification and the conditional generation with extra unlabeled data.
We present a novel training framework to jointly target both PU classification and conditional generation when exposed to extra data.
arXiv Detail & Related papers (2020-06-14T08:27:40Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.