Data Expansion using Back Translation and Paraphrasing for Hate Speech
Detection
- URL: http://arxiv.org/abs/2106.04681v1
- Date: Tue, 25 May 2021 09:52:42 GMT
- Title: Data Expansion using Back Translation and Paraphrasing for Hate Speech
Detection
- Authors: Djamila Romaissa Beddiar and Md Saroar Jahan and Mourad Oussalah
- Abstract summary: We present a new deep learning-based method that fuses a Back Translation method, and a Paraphrasing technique for data augmentation.
We evaluate our proposal on five publicly available datasets; namely, AskFm corpus, Formspring dataset, Warner and Waseem dataset, Olid, and Wikipedia toxic comments dataset.
- Score: 1.192436948211501
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: With proliferation of user generated contents in social media platforms,
establishing mechanisms to automatically identify toxic and abusive content
becomes a prime concern for regulators, researchers, and society. Keeping the
balance between freedom of speech and respecting each other dignity is a major
concern of social media platform regulators. Although, automatic detection of
offensive content using deep learning approaches seems to provide encouraging
results, training deep learning-based models requires large amounts of
high-quality labeled data, which is often missing. In this regard, we present
in this paper a new deep learning-based method that fuses a Back Translation
method, and a Paraphrasing technique for data augmentation. Our pipeline
investigates different word-embedding-based architectures for classification of
hate speech. The back translation technique relies on an encoder-decoder
architecture pre-trained on a large corpus and mostly used for machine
translation. In addition, paraphrasing exploits the transformer model and the
mixture of experts to generate diverse paraphrases. Finally, LSTM, and CNN are
compared to seek enhanced classification results. We evaluate our proposal on
five publicly available datasets; namely, AskFm corpus, Formspring dataset,
Warner and Waseem dataset, Olid, and Wikipedia toxic comments dataset. The
performance of the proposal together with comparison to some related
state-of-art results demonstrate the effectiveness and soundness of our
proposal.
Related papers
- Hate Speech Detection Using Cross-Platform Social Media Data In English and German Language [6.200058263544999]
This study focuses on detecting bilingual hate speech in YouTube comments.
We include factors such as content similarity, definition similarity, and common hate words to measure the impact of datasets on performance.
The best performance was obtained by combining datasets from YouTube comments, Twitter, and Gab with an F1-score of 0.74 and 0.68 for English and German YouTube comments.
arXiv Detail & Related papers (2024-10-02T10:22:53Z) - An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - Improving the Robustness of Summarization Systems with Dual Augmentation [68.53139002203118]
A robust summarization system should be able to capture the gist of the document, regardless of the specific word choices or noise in the input.
We first explore the summarization models' robustness against perturbations including word-level synonym substitution and noise.
We propose a SummAttacker, which is an efficient approach to generating adversarial samples based on language models.
arXiv Detail & Related papers (2023-06-01T19:04:17Z) - Hate Speech and Offensive Language Detection using an Emotion-aware
Shared Encoder [1.8734449181723825]
Existing works on hate speech and offensive language detection produce promising results based on pre-trained transformer models.
This paper addresses a multi-task joint learning approach which combines external emotional features extracted from another corpora.
Our findings demonstrate that emotional knowledge helps to more reliably identify hate speech and offensive language across datasets.
arXiv Detail & Related papers (2023-02-17T09:31:06Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Role of Artificial Intelligence in Detection of Hateful Speech for
Hinglish Data on Social Media [1.8899300124593648]
Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world.
Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages.
We propose a methodology for efficient detection of unstructured code-mix Hinglish language.
arXiv Detail & Related papers (2021-05-11T10:02:28Z) - Named Entity Recognition for Social Media Texts with Semantic
Augmentation [70.44281443975554]
Existing approaches for named entity recognition suffer from data sparsity problems when conducted on short and informal texts.
We propose a neural-based approach to NER for social media texts where both local (from running text) and augmented semantics are taken into account.
arXiv Detail & Related papers (2020-10-29T10:06:46Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - Automatically Ranked Russian Paraphrase Corpus for Text Generation [0.0]
The article is focused on automatic development and ranking of a large corpus for Russian paraphrase generation.
Existing manually annotated paraphrase datasets for Russian are limited to small-sized ParaPhraser corpus and ParaPlag.
arXiv Detail & Related papers (2020-06-17T08:40:52Z) - WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection [0.0]
We propose an original framework, based on the Wikipedia Comment corpus, with comment-level annotations of different types.
This large corpus of more than 380k annotated messages opens perspectives for online abuse detection and especially for context-based approaches.
We also propose, in addition to this corpus, a complete benchmarking platform to stimulate and fairly compare scientific works around the problem of content abuse detection.
arXiv Detail & Related papers (2020-03-13T10:26:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.