Hate and Offensive Speech Detection in Hindi and Marathi
- URL: http://arxiv.org/abs/2110.12200v1
- Date: Sat, 23 Oct 2021 11:57:36 GMT
- Title: Hate and Offensive Speech Detection in Hindi and Marathi
- Authors: Abhishek Velankar, Hrushikesh Patil, Amol Gore, Shubham Salunke,
Raviraj Joshi
- Abstract summary: Still hate and offensive speech detection faces a challenge due to inadequate availability of data.
In this work, we consider hate and offensive speech detection in Hindi and Marathi texts.
We explore different deep learning architectures like CNN, LSTM, and variations of BERT like multilingual BERT, IndicBERT, and monolingual RoBERTa.
We show that the transformer-based models perform the best and even the basic models along with FastText embeddings give a competitive performance.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sentiment analysis is the most basic NLP task to determine the polarity of
text data. There has been a significant amount of work in the area of
multilingual text as well. Still hate and offensive speech detection faces a
challenge due to inadequate availability of data, especially for Indian
languages like Hindi and Marathi. In this work, we consider hate and offensive
speech detection in Hindi and Marathi texts. The problem is formulated as a
text classification task using the state of the art deep learning approaches.
We explore different deep learning architectures like CNN, LSTM, and variations
of BERT like multilingual BERT, IndicBERT, and monolingual RoBERTa. The basic
models based on CNN and LSTM are augmented with fast text word embeddings. We
use the HASOC 2021 Hindi and Marathi hate speech datasets to compare these
algorithms. The Marathi dataset consists of binary labels and the Hindi dataset
consists of binary as well as more-fine grained labels. We show that the
transformer-based models perform the best and even the basic models along with
FastText embeddings give a competitive performance. Moreover, with normal
hyper-parameter tuning, the basic models perform better than BERT-based models
on the fine-grained Hindi dataset.
Related papers
- CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - The First Swahili Language Scene Text Detection and Recognition Dataset [55.83178123785643]
There is a significant gap in low-resource languages, especially the Swahili Language.
Swahili is widely spoken in East African countries but is still an under-explored language in scene text recognition.
We propose a comprehensive dataset of Swahili scene text images and evaluate the dataset on different scene text detection and recognition models.
arXiv Detail & Related papers (2024-05-19T03:55:02Z) - IndiText Boost: Text Augmentation for Low Resource India Languages [0.0]
We focus on implementing techniques like Easy Data Augmentation, Back Translation, Paraphrasing, Text Generation using LLMs, and Text Expansion using LLMs for text classification on different languages.
According to our knowledge, no such work exists for text augmentation on Indian languages.
arXiv Detail & Related papers (2024-01-23T20:54:40Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Mono vs Multilingual BERT for Hate Speech Detection and Text
Classification: A Case Study in Marathi [0.966840768820136]
We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis and simple text classification in Marathi.
We use standard multilingual models such as mBERT, indicBERT and xlm-RoBERTa and compare with MahaBERT, MahaALBERT and MahaRoBERTa, the monolingual models for Marathi.
We show that monolingual MahaBERT based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts.
arXiv Detail & Related papers (2022-04-19T05:07:58Z) - L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and
BERT models [0.7874708385247353]
In India, Marathi is one of the most popular languages used by a wide audience.
In this work, we present L3Cube-MahaHate, the first major Hate Speech dataset in Marathi.
arXiv Detail & Related papers (2022-03-25T17:00:33Z) - Experimental Evaluation of Deep Learning models for Marathi Text
Classification [0.0]
We evaluate CNN, LSTM, ULMFiT, and BERT based models on two publicly available Marathi text classification datasets.
We show that basic single layer models based on CNN and LSTM coupled with FastText embeddings perform on par with the BERT based models on the available datasets.
arXiv Detail & Related papers (2021-01-13T06:21:27Z) - Facebook AI's WMT20 News Translation Task Submission [69.92594751788403]
This paper describes Facebook AI's submission to WMT20 shared news translation task.
We focus on the low resource setting and participate in two language pairs, Tamil -> English and Inuktitut -> English.
We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain.
arXiv Detail & Related papers (2020-11-16T21:49:00Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - Deep Learning for Hindi Text Classification: A Comparison [6.8629257716723]
The research in the classification of morphologically rich and low resource Hindi language written in Devanagari script has been limited due to the absence of large labeled corpus.
In this work, we used translated versions of English data-sets to evaluate models based on CNN, LSTM and Attention.
The paper also serves as a tutorial for popular text classification techniques.
arXiv Detail & Related papers (2020-01-19T09:29:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.