mahaNLP: A Marathi Natural Language Processing Library
- URL: http://arxiv.org/abs/2311.02579v1
- Date: Sun, 5 Nov 2023 06:59:59 GMT
- Title: mahaNLP: A Marathi Natural Language Processing Library
- Authors: Vidula Magdum, Omkar Dhekane, Sharayu Hiwarkhedkar, Saloni Mittal,
Raviraj Joshi
- Abstract summary: We present mahaNLP, an open-source natural language processing (NLP) library specifically built for the Marathi language.
It aims to enhance the support for the low-resource Indian language Marathi in the field of NLP.
- Score: 0.4499833362998489
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present mahaNLP, an open-source natural language processing (NLP) library
specifically built for the Marathi language. It aims to enhance the support for
the low-resource Indian language Marathi in the field of NLP. It is an
easy-to-use, extensible, and modular toolkit for Marathi text analysis built on
state-of-the-art MahaBERT-based transformer models. Our work holds significant
importance as other existing Indic NLP libraries provide basic Marathi
processing support and rely on older models with restricted performance. Our
toolkit stands out by offering a comprehensive array of NLP tasks, encompassing
both fundamental preprocessing tasks and advanced NLP tasks like sentiment
analysis, NER, hate speech detection, and sentence completion. This paper
focuses on an overview of the mahaNLP framework, its features, and its usage.
This work is a part of the L3Cube MahaNLP initiative, more information about it
can be found at https://github.com/l3cube-pune/MarathiNLP .
Related papers
- CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models [59.91221728187576]
This paper introduces the CMU Linguistic Linguistic Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models.
CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages.
arXiv Detail & Related papers (2024-04-03T02:21:46Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - SanskritShala: A Neural Sanskrit NLP Toolkit with Web-Based Interface
for Pedagogical and Annotation Purposes [13.585440544031584]
We present a neural Sanskrit Natural Language Processing (NLP) toolkit named SanskritShala.
Our systems report state-of-the-art performance on available benchmark datasets for all tasks.
SanskritShala is deployed as a web-based application, which allows a user to get real-time analysis for the given input.
arXiv Detail & Related papers (2023-02-19T09:58:55Z) - L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models,
and Library [1.14219428942199]
Despite being the third most popular language in India, the Marathi language lacks useful NLP resources.
With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing.
We present datasets and transformer models for supervised tasks like sentiment analysis, named entity recognition, and hate speech detection.
arXiv Detail & Related papers (2022-05-29T17:51:00Z) - Number Entity Recognition [65.80137628972312]
Numbers are essential components of text, like any other word tokens, from which natural language processing (NLP) models are built and deployed.
In this work, we attempt to tap this potential of state-of-the-art NLP models and transfer their ability to boost performance in related tasks.
Our proposed classification of numbers into entities helps NLP models perform well on several tasks, including a handcrafted Fill-In-The-Blank (FITB) task and on question answering using joint embeddings.
arXiv Detail & Related papers (2022-05-07T05:22:43Z) - L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT
models [0.7874708385247353]
We focus on Marathi, an Indian language, spoken prominently by the people of Maharashtra state.
We present L3Cube-MahaNER, the first major gold standard named entity recognition dataset in Marathi.
In the end, we benchmark the dataset on different CNN, LSTM, and Transformer based models like mBERT, XLM-RoBERTa, IndicBERT, MahaBERT, etc.
arXiv Detail & Related papers (2022-04-12T18:32:15Z) - textless-lib: a Library for Textless Spoken Language Processing [50.070693765984075]
We introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area.
We describe the building blocks that the library provides and demonstrate its usability.
arXiv Detail & Related papers (2022-02-15T12:39:42Z) - "A Passage to India": Pre-trained Word Embeddings for Indian Languages [30.607474624873014]
We use various existing approaches to create multiple word embeddings for 14 Indian languages.
We place these embeddings for all these languages in a single repository.
We release a total of 436 models using 8 different approaches.
arXiv Detail & Related papers (2021-12-27T17:31:04Z) - A Data-Centric Framework for Composable NLP Workflows [109.51144493023533]
Empirical natural language processing systems in application domains (e.g., healthcare, finance, education) involve interoperation among multiple components.
We establish a unified open-source framework to support fast development of such sophisticated NLP in a composable manner.
arXiv Detail & Related papers (2021-03-02T16:19:44Z) - Experimental Evaluation of Deep Learning models for Marathi Text
Classification [0.0]
We evaluate CNN, LSTM, ULMFiT, and BERT based models on two publicly available Marathi text classification datasets.
We show that basic single layer models based on CNN and LSTM coupled with FastText embeddings perform on par with the BERT based models on the available datasets.
arXiv Detail & Related papers (2021-01-13T06:21:27Z) - N-LTP: An Open-source Neural Language Technology Platform for Chinese [68.58732970171747]
textttN- is an open-source neural language technology platform supporting six fundamental Chinese NLP tasks.
textttN- adopts the multi-task framework by using a shared pre-trained model, which has the advantage of capturing the shared knowledge across relevant Chinese tasks.
arXiv Detail & Related papers (2020-09-24T11:45:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.