Ceasing hate withMoH: Hate Speech Detection in Hindi-English
Code-Switched Language
- URL: http://arxiv.org/abs/2110.09393v1
- Date: Mon, 18 Oct 2021 15:24:32 GMT
- Title: Ceasing hate withMoH: Hate Speech Detection in Hindi-English
Code-Switched Language
- Authors: Arushi Sharma, Anubha Kabra, Minni Jain
- Abstract summary: This work focuses on analyzing hate speech in Hindi-English code-switched language.
To contain the structure of data, we developed MoH or Map Only Hindi, which means "Love" in Hindi.
MoH pipeline consists of language identification, Roman to Devanagari Hindi transliteration using a knowledge base of Roman Hindi words.
- Score: 2.9926023796813728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Social media has become a bedrock for people to voice their opinions
worldwide. Due to the greater sense of freedom with the anonymity feature, it
is possible to disregard social etiquette online and attack others without
facing severe consequences, inevitably propagating hate speech. The current
measures to sift the online content and offset the hatred spread do not go far
enough. One factor contributing to this is the prevalence of regional languages
in social media and the paucity of language flexible hate speech detectors. The
proposed work focuses on analyzing hate speech in Hindi-English code-switched
language. Our method explores transformation techniques to capture precise text
representation. To contain the structure of data and yet use it with existing
algorithms, we developed MoH or Map Only Hindi, which means "Love" in Hindi.
MoH pipeline consists of language identification, Roman to Devanagari Hindi
transliteration using a knowledge base of Roman Hindi words. Finally, it
employs the fine-tuned Multilingual Bert and MuRIL language models. We
conducted several quantitative experiment studies on three datasets and
evaluated performance using Precision, Recall, and F1 metrics. The first
experiment studies MoH mapped text's performance with classical machine
learning models and shows an average increase of 13% in F1 scores. The second
compares the proposed work's scores with those of the baseline models and
offers a rise in performance by 6%. Finally, the third reaches the proposed MoH
technique with various data simulations using the existing transliteration
library. Here, MoH outperforms the rest by 15%. Our results demonstrate a
significant improvement in the state-of-the-art scores on all three datasets.
Related papers
- Adversarial Training For Low-Resource Disfluency Correction [50.51901599433536]
We propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC)
We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages.
Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments.
arXiv Detail & Related papers (2023-06-10T08:58:53Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - HateCheckHIn: Evaluating Hindi Hate Speech Detection Models [6.52974752091861]
multilingual hate is a major emerging challenge for automated detection.
We introduce a set of functionalities for the purpose of evaluation.
Considering Hindi as a base language, we craft test cases for each functionality.
arXiv Detail & Related papers (2022-04-30T19:09:09Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - HS-BAN: A Benchmark Dataset of Social Media Comments for Hate Speech
Detection in Bangla [2.055204980188575]
In this paper, we present HS-BAN, a binary class hate speech dataset in Bangla language consisting of more than 50,000 labeled comments.
We explore traditional linguistic features and neural network-based methods to develop a benchmark system for hate speech detection.
Our benchmark shows that a Bi-LSTM model on top of the FastText informal word embedding achieved 86.78% F1-score.
arXiv Detail & Related papers (2021-12-03T13:35:18Z) - Detecting Abusive Albanian [5.092028049119383]
scShaj is an annotated dataset for hate speech and offensive speech constructed from user-text content on various social media platforms.
The dataset is tested using three different classification models, the best of which achieves an F1 score of 0.77 for the identification of offensive language.
arXiv Detail & Related papers (2021-07-28T18:47:32Z) - Role of Artificial Intelligence in Detection of Hateful Speech for
Hinglish Data on Social Media [1.8899300124593648]
Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world.
Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages.
We propose a methodology for efficient detection of unstructured code-mix Hinglish language.
arXiv Detail & Related papers (2021-05-11T10:02:28Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z) - Factorization of Fact-Checks for Low Resource Indian Languages [44.94080515860928]
We introduce FactDRIL: the first large scale multilingual Fact-checking dataset for Regional Indian languages.
Our dataset consists of 9,058 samples belonging to English, 5,155 samples to Hindi and remaining 8,222 samples are distributed across various regional languages.
We expect this dataset will be a valuable resource and serve as a starting point to fight proliferation of fake news in low resource languages.
arXiv Detail & Related papers (2021-02-23T16:47:41Z) - Classification Benchmarks for Under-resourced Bengali Language based on
Multichannel Convolutional-LSTM Network [3.0168410626760034]
We build the largest Bengali word embedding models to date based on 250 million articles, which we call BengFastText.
We incorporate word embeddings into a Multichannel Convolutional-LSTM network for predicting different types of hate speech, document classification, and sentiment analysis.
arXiv Detail & Related papers (2020-04-11T22:17:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.