L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models,
and Library
- URL: http://arxiv.org/abs/2205.14728v2
- Date: Tue, 31 May 2022 15:15:51 GMT
- Title: L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models,
and Library
- Authors: Raviraj Joshi
- Abstract summary: Despite being the third most popular language in India, the Marathi language lacks useful NLP resources.
With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing.
We present datasets and transformer models for supervised tasks like sentiment analysis, named entity recognition, and hate speech detection.
- Score: 1.14219428942199
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite being the third most popular language in India, the Marathi language
lacks useful NLP resources. Moreover, popular NLP libraries do not have support
for the Marathi language. With L3Cube-MahaNLP, we aim to build resources and a
library for Marathi natural language processing. We present datasets and
transformer models for supervised tasks like sentiment analysis, named entity
recognition, and hate speech detection. We have also published a monolingual
Marathi corpus for unsupervised language modeling tasks. Overall we present
MahaCorpus, MahaSent, MahaNER, and MahaHate datasets and their corresponding
MahaBERT models fine-tuned on these datasets. We aim to move ahead of benchmark
datasets and prepare useful resources for Marathi. The resources are available
at https://github.com/l3cube-pune/MarathiNLP.
Related papers
- Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - mahaNLP: A Marathi Natural Language Processing Library [0.4499833362998489]
We present mahaNLP, an open-source natural language processing (NLP) library specifically built for the Marathi language.
It aims to enhance the support for the low-resource Indian language Marathi in the field of NLP.
arXiv Detail & Related papers (2023-11-05T06:59:59Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Mono vs Multilingual BERT for Hate Speech Detection and Text
Classification: A Case Study in Marathi [0.966840768820136]
We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis and simple text classification in Marathi.
We use standard multilingual models such as mBERT, indicBERT and xlm-RoBERTa and compare with MahaBERT, MahaALBERT and MahaRoBERTa, the monolingual models for Marathi.
We show that monolingual MahaBERT based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts.
arXiv Detail & Related papers (2022-04-19T05:07:58Z) - L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT
models [0.7874708385247353]
We focus on Marathi, an Indian language, spoken prominently by the people of Maharashtra state.
We present L3Cube-MahaNER, the first major gold standard named entity recognition dataset in Marathi.
In the end, we benchmark the dataset on different CNN, LSTM, and Transformer based models like mBERT, XLM-RoBERTa, IndicBERT, MahaBERT, etc.
arXiv Detail & Related papers (2022-04-12T18:32:15Z) - L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT
Language Models, and Resources [1.14219428942199]
We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources.
We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens.
We show the effectiveness of these resources on downstream classification and NER tasks.
arXiv Detail & Related papers (2022-02-02T17:35:52Z) - Experimental Evaluation of Deep Learning models for Marathi Text
Classification [0.0]
We evaluate CNN, LSTM, ULMFiT, and BERT based models on two publicly available Marathi text classification datasets.
We show that basic single layer models based on CNN and LSTM coupled with FastText embeddings perform on par with the BERT based models on the available datasets.
arXiv Detail & Related papers (2021-01-13T06:21:27Z) - Soft Gazetteers for Low-Resource Named Entity Recognition [78.00856159473393]
We propose a method of "soft gazetteers" that incorporates ubiquitously available information from English knowledge bases into neural named entity recognition models.
Our experiments on four low-resource languages show an average improvement of 4 points in F1 score.
arXiv Detail & Related papers (2020-05-04T21:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.