A Review of the Marathi Natural Language Processing
- URL: http://arxiv.org/abs/2412.15471v2
- Date: Tue, 24 Dec 2024 13:33:51 GMT
- Title: A Review of the Marathi Natural Language Processing
- Authors: Asang Dani, Shailesh R Sathe,
- Abstract summary: This paper presents a broad overview of evolution of NLP research in Indic languages.
It focuses on Marathi and state-of-the-art resources and tools available to the research community.
- Score: 0.0
- License:
- Abstract: Marathi is one of the most widely used languages in the world. One might expect that the latest advances in NLP research in languages like English reach such a large community. However, NLP advancements in English didn't immediately reach Indian languages like Marathi. There were several reasons for this. They included diversity of scripts used, lack of (publicly available) resources like tokenization strategies, high quality datasets \& benchmarks, and evaluation metrics. In addition to this, the morphologically rich nature of Marathi, made NLP tasks challenging. Advances in Neural Network (NN) based models and tools since the early 2000s helped improve this situation and make NLP research more accessible. In the past 10 years, significant efforts were made to improve language resources for all 22 scheduled languages of India. This paper presents a broad overview of evolution of NLP research in Indic languages with a focus on Marathi and state-of-the-art resources and tools available to the research community. It also provides an overview of tools \& techniques associated with Marathi NLP tasks.
Related papers
- Survey of Pseudonymization, Abstractive Summarization & Spell Checker for Hindi and Marathi [0.0]
The paper aims to build a platform which enables the user to use various features like text anonymization, abstractive text summarization and spell checking in English, Hindi and Marathi language.
The aim of these tools is to serve enterprise and consumer clients who predominantly use Indian Regional languages.
arXiv Detail & Related papers (2024-12-24T04:51:32Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Decoding the Diversity: A Review of the Indic AI Research Landscape [0.7864304771129751]
Indic languages are those spoken in the Indian subcontinent, including India, Pakistan, Bangladesh, Sri Lanka, Nepal, and Bhutan.
This review paper provides a comprehensive overview of large language model (LLM) research directions within Indic languages.
arXiv Detail & Related papers (2024-06-13T19:55:20Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - mahaNLP: A Marathi Natural Language Processing Library [0.4499833362998489]
We present mahaNLP, an open-source natural language processing (NLP) library specifically built for the Marathi language.
It aims to enhance the support for the low-resource Indian language Marathi in the field of NLP.
arXiv Detail & Related papers (2023-11-05T06:59:59Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models,
and Library [1.14219428942199]
Despite being the third most popular language in India, the Marathi language lacks useful NLP resources.
With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing.
We present datasets and transformer models for supervised tasks like sentiment analysis, named entity recognition, and hate speech detection.
arXiv Detail & Related papers (2022-05-29T17:51:00Z) - "A Passage to India": Pre-trained Word Embeddings for Indian Languages [30.607474624873014]
We use various existing approaches to create multiple word embeddings for 14 Indian languages.
We place these embeddings for all these languages in a single repository.
We release a total of 436 models using 8 different approaches.
arXiv Detail & Related papers (2021-12-27T17:31:04Z) - SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological
Inflection [81.85463892070085]
The SIGMORPHON 2020 task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages.
Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages.
arXiv Detail & Related papers (2020-06-20T13:24:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.