Role of Artificial Intelligence in Detection of Hateful Speech for
Hinglish Data on Social Media
- URL: http://arxiv.org/abs/2105.04913v1
- Date: Tue, 11 May 2021 10:02:28 GMT
- Title: Role of Artificial Intelligence in Detection of Hateful Speech for
Hinglish Data on Social Media
- Authors: Ananya Srivastava, Mohammed Hasan, Bhargav Yagnik, Rahee Walambe and
Ketan Kotecha
- Abstract summary: Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world.
Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages.
We propose a methodology for efficient detection of unstructured code-mix Hinglish language.
- Score: 1.8899300124593648
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Social networking platforms provide a conduit to disseminate our ideas, views
and thoughts and proliferate information. This has led to the amalgamation of
English with natively spoken languages. Prevalence of Hindi-English code-mixed
data (Hinglish) is on the rise with most of the urban population all over the
world. Hate speech detection algorithms deployed by most social networking
platforms are unable to filter out offensive and abusive content posted in
these code-mixed languages. Thus, the worldwide hate speech detection rate of
around 44% drops even more considering the content in Indian colloquial
languages and slangs. In this paper, we propose a methodology for efficient
detection of unstructured code-mix Hinglish language. Fine-tuning based
approaches for Hindi-English code-mixed language are employed by utilizing
contextual based embeddings such as ELMo (Embeddings for Language Models),
FLAIR, and transformer-based BERT (Bidirectional Encoder Representations from
Transformers). Our proposed approach is compared against the pre-existing
methods and results are compared for various datasets. Our model outperforms
the other methods and frameworks.
Related papers
- CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - Countering Malicious Content Moderation Evasion in Online Social
Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems.
This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z) - BERTuit: Understanding Spanish language in Twitter through a native
transformer [70.77033762320572]
We present bfBERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets.
Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network.
arXiv Detail & Related papers (2022-04-07T14:28:51Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - AtteSTNet -- An attention and subword tokenization based approach for
code-switched text hate speech detection [1.3190581566723918]
Language used in social media is often a combination of English and the native language in the region.
In India, Hindi is used predominantly and is often code-switched with English, giving rise to the Hinglish (Hindi+English) language.
arXiv Detail & Related papers (2021-12-10T20:01:44Z) - Ceasing hate withMoH: Hate Speech Detection in Hindi-English
Code-Switched Language [2.9926023796813728]
This work focuses on analyzing hate speech in Hindi-English code-switched language.
To contain the structure of data, we developed MoH or Map Only Hindi, which means "Love" in Hindi.
MoH pipeline consists of language identification, Roman to Devanagari Hindi transliteration using a knowledge base of Roman Hindi words.
arXiv Detail & Related papers (2021-10-18T15:24:32Z) - A Unified System for Aggression Identification in English Code-Mixed and
Uni-Lingual Texts [25.15521897068512]
We introduce a unified and robust deep learning architecture which works for English code-mixed dataset and uni-lingual English dataset.
The devised system, uses psycho-linguistic features and very ba-sic linguistic features.
Our proposed system outperforms all the previous approaches on English code-mixed dataset and uni-lingual English dataset.
arXiv Detail & Related papers (2020-01-15T17:06:29Z) - "Hinglish" Language -- Modeling a Messy Code-Mixed Language [0.0]
This project focuses on using deep learning techniques to tackle a classification problem in categorizing social content written in Hindi-English into Abusive, Hate-Inducing and Not offensive categories.
We utilize bi-directional sequence models with easy text augmentation techniques such as synonym replacement, random insertion, random swap, and random deletion.
arXiv Detail & Related papers (2019-12-30T23:01:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.