Related papers: Noisy Self-Training with Data Augmentations for Offensive and Hate Speech Detection Tasks

Noisy Self-Training with Data Augmentations for Offensive and Hate Speech Detection Tasks

URL: http://arxiv.org/abs/2307.16609v1
Date: Mon, 31 Jul 2023 12:35:54 GMT
Title: Noisy Self-Training with Data Augmentations for Offensive and Hate Speech Detection Tasks
Authors: Jo\~ao A. Leite, Carolina Scarton, Diego F. Silva
Abstract summary: "Noisy" self-training approaches incorporate data augmentation techniques to ensure prediction consistency and increase robustness against adversarial attacks. We evaluate our experiments on two offensive/hate-speech datasets and demonstrate that (i) self-training consistently improves performance regardless of model size, resulting in up to +1.5% F1-macro on both datasets, and (ii) noisy self-training with textual data augmentations, despite being successfully applied in similar settings, decreases performance on offensive and hate-speech domains when compared to the default method, even with state-of-the-art augmentations such as backtranslation.
Score: 3.703767478524629
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Online social media is rife with offensive and hateful comments, prompting the need for their automatic detection given the sheer amount of posts created every second. Creating high-quality human-labelled datasets for this task is difficult and costly, especially because non-offensive posts are significantly more frequent than offensive ones. However, unlabelled data is abundant, easier, and cheaper to obtain. In this scenario, self-training methods, using weakly-labelled examples to increase the amount of training data, can be employed. Recent "noisy" self-training approaches incorporate data augmentation techniques to ensure prediction consistency and increase robustness against noisy data and adversarial attacks. In this paper, we experiment with default and noisy self-training using three different textual data augmentation techniques across five different pre-trained BERT architectures varying in size. We evaluate our experiments on two offensive/hate-speech datasets and demonstrate that (i) self-training consistently improves performance regardless of model size, resulting in up to +1.5% F1-macro on both datasets, and (ii) noisy self-training with textual data augmentations, despite being successfully applied in similar settings, decreases performance on offensive and hate-speech domains when compared to the default method, even with state-of-the-art augmentations such as backtranslation.

Related papers

Masked Self-Supervised Pre-Training for Text Recognition Transformers on Large-Scale Datasets [0.0]
Self-supervised learning has emerged as a powerful approach for leveraging large-scale unlabeled data to improve model performance. We propose two modifications to the pre-training phase: progressively increasing the masking probability, and modifying the loss function to incorporate both masked and non-masked patches. We conduct extensive experiments using a dataset of 50M unlabeled text lines for pre-training and four differently sized annotated datasets for fine-tuning.
arXiv Detail & Related papers (2025-03-28T15:16:48Z)
Sexism Detection on a Data Diet [14.899608305188002]
We show how we can leverage influence scores to estimate the importance of a data point while training a model. We evaluate the model performance trained on data pruned with different pruning strategies on three out-of-domain datasets.
arXiv Detail & Related papers (2024-06-07T12:39:54Z)
Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods. We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z)
Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences. We formulate each task as a sequence-to-sequence problem and perform multi-task training. We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z)
On-the-fly Denoising for Data Augmentation in Natural Language Understanding [101.46848743193358]
We propose an on-the-fly denoising technique for data augmentation that learns from soft augmented labels provided by an organic teacher model trained on the cleaner original data. Our method can be applied to general augmentation techniques and consistently improve the performance on both text classification and question-answering tasks.
arXiv Detail & Related papers (2022-12-20T18:58:33Z)
Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues. In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z)
Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model. In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z)
Noisy student-teacher training for robust keyword spotting [13.264760485020757]
We propose self-training with noisy student-teacher approach for streaming keyword spotting. The proposed method applies aggressive data augmentation on the input of both student and teacher. Experiments show that the proposed self-training with noisy student-teacher training improves accuracy of some difficult-conditioned test sets by as much as 60%.
arXiv Detail & Related papers (2021-06-03T05:36:18Z)
Self-training Improves Pre-training for Natural Language Understanding [63.78927366363178]
We study self-training as another way to leverage unlabeled data through semi-supervised learning. We introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data. Our approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks.
arXiv Detail & Related papers (2020-10-05T17:52:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.