Noisy Self-Training with Data Augmentations for Offensive and Hate
Speech Detection Tasks
- URL: http://arxiv.org/abs/2307.16609v1
- Date: Mon, 31 Jul 2023 12:35:54 GMT
- Title: Noisy Self-Training with Data Augmentations for Offensive and Hate
Speech Detection Tasks
- Authors: Jo\~ao A. Leite, Carolina Scarton, Diego F. Silva
- Abstract summary: "Noisy" self-training approaches incorporate data augmentation techniques to ensure prediction consistency and increase robustness against adversarial attacks.
We evaluate our experiments on two offensive/hate-speech datasets and demonstrate that (i) self-training consistently improves performance regardless of model size, resulting in up to +1.5% F1-macro on both datasets, and (ii) noisy self-training with textual data augmentations, despite being successfully applied in similar settings, decreases performance on offensive and hate-speech domains when compared to the default method, even with state-of-the-art augmentations such as backtranslation.
- Score: 3.703767478524629
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Online social media is rife with offensive and hateful comments, prompting
the need for their automatic detection given the sheer amount of posts created
every second. Creating high-quality human-labelled datasets for this task is
difficult and costly, especially because non-offensive posts are significantly
more frequent than offensive ones. However, unlabelled data is abundant,
easier, and cheaper to obtain. In this scenario, self-training methods, using
weakly-labelled examples to increase the amount of training data, can be
employed. Recent "noisy" self-training approaches incorporate data augmentation
techniques to ensure prediction consistency and increase robustness against
noisy data and adversarial attacks. In this paper, we experiment with default
and noisy self-training using three different textual data augmentation
techniques across five different pre-trained BERT architectures varying in
size. We evaluate our experiments on two offensive/hate-speech datasets and
demonstrate that (i) self-training consistently improves performance regardless
of model size, resulting in up to +1.5% F1-macro on both datasets, and (ii)
noisy self-training with textual data augmentations, despite being successfully
applied in similar settings, decreases performance on offensive and hate-speech
domains when compared to the default method, even with state-of-the-art
augmentations such as backtranslation.
Related papers
- Sexism Detection on a Data Diet [14.899608305188002]
We show how we can leverage influence scores to estimate the importance of a data point while training a model.
We evaluate the model performance trained on data pruned with different pruning strategies on three out-of-domain datasets.
arXiv Detail & Related papers (2024-06-07T12:39:54Z) - Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods.
We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - On-the-fly Denoising for Data Augmentation in Natural Language
Understanding [101.46848743193358]
We propose an on-the-fly denoising technique for data augmentation that learns from soft augmented labels provided by an organic teacher model trained on the cleaner original data.
Our method can be applied to general augmentation techniques and consistently improve the performance on both text classification and question-answering tasks.
arXiv Detail & Related papers (2022-12-20T18:58:33Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Noisy student-teacher training for robust keyword spotting [13.264760485020757]
We propose self-training with noisy student-teacher approach for streaming keyword spotting.
The proposed method applies aggressive data augmentation on the input of both student and teacher.
Experiments show that the proposed self-training with noisy student-teacher training improves accuracy of some difficult-conditioned test sets by as much as 60%.
arXiv Detail & Related papers (2021-06-03T05:36:18Z) - Self-training Improves Pre-training for Natural Language Understanding [63.78927366363178]
We study self-training as another way to leverage unlabeled data through semi-supervised learning.
We introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data.
Our approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks.
arXiv Detail & Related papers (2020-10-05T17:52:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.