SanskritShala: A Neural Sanskrit NLP Toolkit with Web-Based Interface
for Pedagogical and Annotation Purposes
- URL: http://arxiv.org/abs/2302.09527v2
- Date: Mon, 29 May 2023 07:36:21 GMT
- Title: SanskritShala: A Neural Sanskrit NLP Toolkit with Web-Based Interface
for Pedagogical and Annotation Purposes
- Authors: Jivnesh Sandhan, Anshul Agarwal, Laxmidhar Behera, Tushar Sandhan and
Pawan Goyal
- Abstract summary: We present a neural Sanskrit Natural Language Processing (NLP) toolkit named SanskritShala.
Our systems report state-of-the-art performance on available benchmark datasets for all tasks.
SanskritShala is deployed as a web-based application, which allows a user to get real-time analysis for the given input.
- Score: 13.585440544031584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a neural Sanskrit Natural Language Processing (NLP) toolkit named
SanskritShala (a school of Sanskrit) to facilitate computational linguistic
analyses for several tasks such as word segmentation, morphological tagging,
dependency parsing, and compound type identification. Our systems currently
report state-of-the-art performance on available benchmark datasets for all
tasks. SanskritShala is deployed as a web-based application, which allows a
user to get real-time analysis for the given input. It is built with
easy-to-use interactive data annotation features that allow annotators to
correct the system predictions when it makes mistakes. We publicly release the
source codes of the 4 modules included in the toolkit, 7 word embedding models
that have been trained on publicly available Sanskrit corpora and multiple
annotated datasets such as word similarity, relatedness, categorization,
analogy prediction to assess intrinsic properties of word embeddings. So far as
we know, this is the first neural-based Sanskrit NLP toolkit that has a
web-based interface and a number of NLP modules. We are sure that the people
who are willing to work with Sanskrit will find it useful for pedagogical and
annotative purposes. SanskritShala is available at:
https://cnerg.iitkgp.ac.in/sanskritshala. The demo video of our platform can be
accessed at: https://youtu.be/x0X31Y9k0mw4.
Related papers
- One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks [26.848664285007022]
ByT5-Sanskrit is designed for NLP applications involving the morphologically rich language Sanskrit.
It is easier to deploy and more robust to data not covered by external linguistic resources.
We show that our approach yields new best scores for lemmatization and dependency parsing of other morphologically rich languages.
arXiv Detail & Related papers (2024-09-20T22:02:26Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models [59.91221728187576]
This paper introduces the CMU Linguistic Linguistic Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models.
CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages.
arXiv Detail & Related papers (2024-04-03T02:21:46Z) - Linguistically-Informed Neural Architectures for Lexical, Syntactic and
Semantic Tasks in Sanskrit [1.184066113335041]
This thesis aims to make Sanskrit manuscripts more accessible to the end-users through natural language technologies.
The morphological richness, compounding, free word orderliness, and low-resource nature of Sanskrit pose significant challenges for developing deep learning solutions.
We identify four fundamental tasks, which are crucial for developing a robust NLP technology for Sanskrit.
arXiv Detail & Related papers (2023-08-17T06:33:33Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Part-of-Speech Tagging of Odia Language Using statistical and Deep
Learning-Based Approaches [0.0]
This research work is to present a conditional random field (CRF) and deep learning-based approaches (CNN and Bi-LSTM) to develop Odia part-of-speech tagger.
It has been observed that Bi-LSTM model with character sequence feature and pre-trained word vector achieved a significant state-of-the-art result.
arXiv Detail & Related papers (2022-07-07T12:15:23Z) - Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding.
XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model.
Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - An Augmented Translation Technique for low Resource language pair:
Sanskrit to Hindi translation [0.0]
In this work, Zero Shot Translation (ZST) is inspected for a low resource language pair.
The same architecture is tested for Sanskrit to Hindi translation for which data is sparse.
Dimensionality reduction of word embedding is performed to reduce the memory usage for data storage.
arXiv Detail & Related papers (2020-06-09T17:01:55Z) - Neural Approaches for Data Driven Dependency Parsing in Sanskrit [19.844420181108177]
We evaluate four different data-driven machine learning models, originally proposed for different languages, and compare their performances on Sanskrit data.
We compare the performance of each of the models in a low-resource setting, with 1,500 sentences for training.
We also investigate the impact of word ordering in which the sentences are provided as input to these systems, by parsing verses and their corresponding prose order.
arXiv Detail & Related papers (2020-04-17T06:47:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.