Mono vs Multilingual BERT for Hate Speech Detection and Text
Classification: A Case Study in Marathi
- URL: http://arxiv.org/abs/2204.08669v1
- Date: Tue, 19 Apr 2022 05:07:58 GMT
- Title: Mono vs Multilingual BERT for Hate Speech Detection and Text
Classification: A Case Study in Marathi
- Authors: Abhishek Velankar, Hrushikesh Patil, Raviraj Joshi
- Abstract summary: We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis and simple text classification in Marathi.
We use standard multilingual models such as mBERT, indicBERT and xlm-RoBERTa and compare with MahaBERT, MahaALBERT and MahaRoBERTa, the monolingual models for Marathi.
We show that monolingual MahaBERT based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts.
- Score: 0.966840768820136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers are the most eminent architectures used for a vast range of
Natural Language Processing tasks. These models are pre-trained over a large
text corpus and are meant to serve state-of-the-art results over tasks like
text classification. In this work, we conduct a comparative study between
monolingual and multilingual BERT models. We focus on the Marathi language and
evaluate the models on the datasets for hate speech detection, sentiment
analysis and simple text classification in Marathi. We use standard
multilingual models such as mBERT, indicBERT and xlm-RoBERTa and compare with
MahaBERT, MahaALBERT and MahaRoBERTa, the monolingual models for Marathi. We
further show that Marathi monolingual models outperform the multilingual BERT
variants on five different downstream fine-tuning experiments. We also evaluate
sentence embeddings from these models by freezing the BERT encoder layers. We
show that monolingual MahaBERT based models provide rich representations as
compared to sentence embeddings from multi-lingual counterparts. However, we
observe that these embeddings are not generic enough and do not work well on
out of domain social media datasets. We consider two Marathi hate speech
datasets L3Cube-MahaHate, HASOC-2021, a Marathi sentiment classification
dataset L3Cube-MahaSent, and Marathi Headline, Articles classification
datasets.
Related papers
- mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus [52.83121058429025]
We introduce mOSCAR, the first large-scale multilingual and multimodal document corpus crawled from the web.
It covers 163 languages, 315M documents, 214B tokens and 1.2B images.
It shows a strong boost in few-shot learning performance across various multilingual image-text tasks and benchmarks.
arXiv Detail & Related papers (2024-06-13T00:13:32Z) - L3Cube-MahaNews: News-based Short Text and Long Document Classification Datasets in Marathi [0.4194295877935868]
We introduce L3Cube-MahaNews, a Marathi text classification corpus that focuses on News headlines and articles.
This corpus stands out as the largest supervised Marathi Corpus, containing over 1.05L records classified into a diverse range of 12 categories.
To accommodate different document lengths, MahaNews comprises three supervised datasets specifically designed for short text, long documents, and medium paragraphs.
arXiv Detail & Related papers (2024-04-28T15:20:45Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Distilling Efficient Language-Specific Models for Cross-Lingual Transfer [75.32131584449786]
Massively multilingual Transformers (MMTs) are widely used for cross-lingual transfer learning.
MMTs' language coverage makes them unnecessarily expensive to deploy in terms of model size, inference time, energy, and hardware cost.
We propose to extract compressed, language-specific models from MMTs which retain the capacity of the original MMTs for cross-lingual transfer.
arXiv Detail & Related papers (2023-06-02T17:31:52Z) - L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking
BERT Sentence Representations for Hindi and Marathi [0.7874708385247353]
This work focuses on two low-resource Indian languages, Hindi and Marathi.
We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation.
We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi.
arXiv Detail & Related papers (2022-11-21T05:15:48Z) - L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and
BERT models [0.7874708385247353]
In India, Marathi is one of the most popular languages used by a wide audience.
In this work, we present L3Cube-MahaHate, the first major Hate Speech dataset in Marathi.
arXiv Detail & Related papers (2022-03-25T17:00:33Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Hate and Offensive Speech Detection in Hindi and Marathi [0.0]
Still hate and offensive speech detection faces a challenge due to inadequate availability of data.
In this work, we consider hate and offensive speech detection in Hindi and Marathi texts.
We explore different deep learning architectures like CNN, LSTM, and variations of BERT like multilingual BERT, IndicBERT, and monolingual RoBERTa.
We show that the transformer-based models perform the best and even the basic models along with FastText embeddings give a competitive performance.
arXiv Detail & Related papers (2021-10-23T11:57:36Z) - Experimental Evaluation of Deep Learning models for Marathi Text
Classification [0.0]
We evaluate CNN, LSTM, ULMFiT, and BERT based models on two publicly available Marathi text classification datasets.
We show that basic single layer models based on CNN and LSTM coupled with FastText embeddings perform on par with the BERT based models on the available datasets.
arXiv Detail & Related papers (2021-01-13T06:21:27Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.