L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and
BERT models
- URL: http://arxiv.org/abs/2203.13778v1
- Date: Fri, 25 Mar 2022 17:00:33 GMT
- Title: L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and
BERT models
- Authors: Abhishek Velankar, Hrushikesh Patil, Amol Gore, Shubham Salunke,
Raviraj Joshi
- Abstract summary: In India, Marathi is one of the most popular languages used by a wide audience.
In this work, we present L3Cube-MahaHate, the first major Hate Speech dataset in Marathi.
- Score: 0.7874708385247353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Social media platforms are used by a large number of people prominently to
express their thoughts and opinions. However, these platforms have contributed
to a substantial amount of hateful and abusive content as well. Therefore, it
is important to curb the spread of hate speech on these platforms. In India,
Marathi is one of the most popular languages used by a wide audience. In this
work, we present L3Cube-MahaHate, the first major Hate Speech Dataset in
Marathi. The dataset is curated from Twitter, annotated manually. Our dataset
consists of over 25000 distinct tweets labeled into four major classes i.e
hate, offensive, profane, and not. We present the approaches used for
collecting and annotating the data and the challenges faced during the process.
Finally, we present baseline classification results using deep learning models
based on CNN, LSTM, and Transformers. We explore mono-lingual and multi-lingual
variants of BERT like MahaBERT, IndicBERT, mBERT, and xlm-RoBERTa and show that
mono-lingual models perform better than their multi-lingual counterparts. The
MahaBERT model provides the best results on L3Cube-MahaHate Corpus. The data
and models are available at https://github.com/l3cube-pune/MarathiNLP .
Related papers
- CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - Mono vs Multilingual BERT for Hate Speech Detection and Text
Classification: A Case Study in Marathi [0.966840768820136]
We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis and simple text classification in Marathi.
We use standard multilingual models such as mBERT, indicBERT and xlm-RoBERTa and compare with MahaBERT, MahaALBERT and MahaRoBERTa, the monolingual models for Marathi.
We show that monolingual MahaBERT based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts.
arXiv Detail & Related papers (2022-04-19T05:07:58Z) - BERTuit: Understanding Spanish language in Twitter through a native
transformer [70.77033762320572]
We present bfBERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets.
Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network.
arXiv Detail & Related papers (2022-04-07T14:28:51Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Hate and Offensive Speech Detection in Hindi and Marathi [0.0]
Still hate and offensive speech detection faces a challenge due to inadequate availability of data.
In this work, we consider hate and offensive speech detection in Hindi and Marathi texts.
We explore different deep learning architectures like CNN, LSTM, and variations of BERT like multilingual BERT, IndicBERT, and monolingual RoBERTa.
We show that the transformer-based models perform the best and even the basic models along with FastText embeddings give a competitive performance.
arXiv Detail & Related papers (2021-10-23T11:57:36Z) - Ceasing hate withMoH: Hate Speech Detection in Hindi-English
Code-Switched Language [2.9926023796813728]
This work focuses on analyzing hate speech in Hindi-English code-switched language.
To contain the structure of data, we developed MoH or Map Only Hindi, which means "Love" in Hindi.
MoH pipeline consists of language identification, Roman to Devanagari Hindi transliteration using a knowledge base of Roman Hindi words.
arXiv Detail & Related papers (2021-10-18T15:24:32Z) - L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset [0.0]
This paper presents the first major publicly available Marathi Sentiment Analysis dataset - L3MahaSent.
It is curated using tweets extracted from various Maharashtrian personalities' Twitter accounts.
Our dataset consists of 16,000 distinct tweets classified in three broad classes viz. positive, negative, and neutral.
arXiv Detail & Related papers (2021-03-21T14:22:13Z) - Evaluation of Deep Learning Models for Hostility Detection in Hindi Text [2.572404739180802]
We present approaches for hostile text detection in the Hindi language.
The proposed approaches are evaluated on the Constraint@AAAI 2021 Hindi hostility detection dataset.
We evaluate a host of deep learning approaches based on CNN, LSTM, and BERT for this multi-label classification problem.
arXiv Detail & Related papers (2021-01-11T19:10:57Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.