Automatic Textual Normalization for Hate Speech Detection
- URL: http://arxiv.org/abs/2311.06851v4
- Date: Thu, 25 Jul 2024 06:41:43 GMT
- Title: Automatic Textual Normalization for Hate Speech Detection
- Authors: Anh Thi-Hoang Nguyen, Dung Ha Nguyen, Nguyet Thi Nguyen, Khanh Thanh-Duy Ho, Kiet Van Nguyen,
- Abstract summary: Social media data contains a wide range of non-standard words (NSW)
Current state-of-the-art methods for the Vietnamese language address this issue as a problem of lexical normalization.
Our approach is straightforward, employing solely a sequence-to-sequence (Seq2Seq) model.
- Score: 0.8990550886501417
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Social media data is a valuable resource for research, yet it contains a wide range of non-standard words (NSW). These irregularities hinder the effective operation of NLP tools. Current state-of-the-art methods for the Vietnamese language address this issue as a problem of lexical normalization, involving the creation of manual rules or the implementation of multi-staged deep learning frameworks, which necessitate extensive efforts to craft intricate rules. In contrast, our approach is straightforward, employing solely a sequence-to-sequence (Seq2Seq) model. In this research, we provide a dataset for textual normalization, comprising 2,181 human-annotated comments with an inter-annotator agreement of 0.9014. By leveraging the Seq2Seq model for textual normalization, our results reveal that the accuracy achieved falls slightly short of 70%. Nevertheless, textual normalization enhances the accuracy of the Hate Speech Detection (HSD) task by approximately 2%, demonstrating its potential to improve the performance of complex NLP tasks. Our dataset is accessible for research purposes.
Related papers
- Localizing Factual Inconsistencies in Attributable Text Generation [91.981439746404]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation.
We first demonstrate the effectiveness of the QASemConsistency methodology for human annotation.
We then implement several methods for automatically detecting localized factual inconsistencies.
arXiv Detail & Related papers (2024-10-09T22:53:48Z) - A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media [1.053698976085779]
This study introduces an innovative automatic labeling framework to address the challenges of lexical normalization in social media texts.
We propose a framework that integrates semi-supervised learning with weak supervision techniques.
Our framework automatically labels raw data, converting non-standard vocabulary into standardized forms.
arXiv Detail & Related papers (2024-09-30T16:26:40Z) - Thesis proposal: Are We Losing Textual Diversity to Natural Language Processing? [3.8073142980733]
We ask whether the algorithms used in Neural Machine Translation have inherent inductive biases that are beneficial for most types of inputs but might harm the processing of untypical texts.
We conduct a series of experiments to investigate whether NMT systems struggle with maintaining the diversity of such texts.
Our ultimate goal is to develop alternatives that do not enforce uniformity in the distribution of statistical properties in the output.
arXiv Detail & Related papers (2024-09-15T01:06:07Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - Rethinking and Improving Multi-task Learning for End-to-end Speech
Translation [51.713683037303035]
We investigate the consistency between different tasks, considering different times and modules.
We find that the textual encoder primarily facilitates cross-modal conversion, but the presence of noise in speech impedes the consistency between text and speech representations.
We propose an improved multi-task learning (IMTL) approach for the ST task, which bridges the modal gap by mitigating the difference in length and representation.
arXiv Detail & Related papers (2023-11-07T08:48:46Z) - Reference Free Domain Adaptation for Translation of Noisy Questions with
Question Specific Rewards [22.297433705607464]
Translating questions using Neural Machine Translation poses more challenges in noisy environments.
We propose a training methodology that fine-tunes the NMT system only using source-side data.
Our approach balances adequacy and fluency by utilizing a loss function that combines BERTScore and Masked Language Model (MLM) Score.
arXiv Detail & Related papers (2023-10-23T18:08:01Z) - Unify word-level and span-level tasks: NJUNLP's Participation for the
WMT2023 Quality Estimation Shared Task [59.46906545506715]
We introduce the NJUNLP team to the WMT 2023 Quality Estimation (QE) shared task.
Our team submitted predictions for the English-German language pair on all two sub-tasks.
Our models achieved the best results in English-German for both word-level and fine-grained error span detection sub-tasks.
arXiv Detail & Related papers (2023-09-23T01:52:14Z) - Does Correction Remain A Problem For Large Language Models? [63.24433996856764]
This paper investigates the role of correction in the context of large language models by conducting two experiments.
The first experiment focuses on correction as a standalone task, employing few-shot learning techniques with GPT-like models for error correction.
The second experiment explores the notion of correction as a preparatory task for other NLP tasks, examining whether large language models can tolerate and perform adequately on texts containing certain levels of noise or errors.
arXiv Detail & Related papers (2023-08-03T14:09:31Z) - AdaPrompt: Adaptive Model Training for Prompt-based NLP [77.12071707955889]
We propose AdaPrompt, adaptively retrieving external data for continual pretraining of PLMs.
Experimental results on five NLP benchmarks show that AdaPrompt can improve over standard PLMs in few-shot settings.
In zero-shot settings, our method outperforms standard prompt-based methods by up to 26.35% relative error reduction.
arXiv Detail & Related papers (2022-02-10T04:04:57Z) - Sequence-to-Sequence Lexical Normalization with Multilingual
Transformers [3.3302293148249125]
Current benchmark tasks for natural language processing contain text that is qualitatively different from the text used in informal day to day digital communication.
This discrepancy has led to severe performance degradation of state-of-the-art NLP models when fine-tuned on real-world data.
We propose a sentence-level sequence-to-sequence model based on mBART, which frames the problem as a machine translation problem.
arXiv Detail & Related papers (2021-10-06T15:53:20Z) - Aggressive Language Detection with Joint Text Normalization via
Adversarial Multi-task Learning [31.02484600391725]
Aggressive language detection (ALD) is one of the crucial applications in NLP community.
In this work, we target improving the ALD by jointly performing text normalization (TN), via an adversarial multi-task learning framework.
arXiv Detail & Related papers (2020-09-19T06:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.