Part-of-Speech Tagging of Odia Language Using statistical and Deep
Learning-Based Approaches
- URL: http://arxiv.org/abs/2207.03256v1
- Date: Thu, 7 Jul 2022 12:15:23 GMT
- Title: Part-of-Speech Tagging of Odia Language Using statistical and Deep
Learning-Based Approaches
- Authors: Tusarkanta Dalai, Tapas Kumar Mishra and Pankaj K Sa
- Abstract summary: This research work is to present a conditional random field (CRF) and deep learning-based approaches (CNN and Bi-LSTM) to develop Odia part-of-speech tagger.
It has been observed that Bi-LSTM model with character sequence feature and pre-trained word vector achieved a significant state-of-the-art result.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic Part-of-speech (POS) tagging is a preprocessing step of many
natural language processing (NLP) tasks such as name entity recognition (NER),
speech processing, information extraction, word sense disambiguation, and
machine translation. It has already gained a promising result in English and
European languages, but in Indian languages, particularly in Odia language, it
is not yet well explored because of the lack of supporting tools, resources,
and morphological richness of language. Unfortunately, we were unable to locate
an open source POS tagger for Odia, and only a handful of attempts have been
made to develop POS taggers for Odia language. The main contribution of this
research work is to present a conditional random field (CRF) and deep
learning-based approaches (CNN and Bidirectional Long Short-Term Memory) to
develop Odia part-of-speech tagger. We used a publicly accessible corpus and
the dataset is annotated with the Bureau of Indian Standards (BIS) tagset.
However, most of the languages around the globe have used the dataset annotated
with Universal Dependencies (UD) tagset. Hence, to maintain uniformity Odia
dataset should use the same tagset. So we have constructed a simple mapping
from BIS tagset to UD tagset. We experimented with various feature set inputs
to the CRF model, observed the impact of constructed feature set. The deep
learning-based model includes Bi-LSTM network, CNN network, CRF layer,
character sequence information, and pre-trained word vector. Character sequence
information was extracted by using convolutional neural network (CNN) and
Bi-LSTM network. Six different combinations of neural sequence labelling models
are implemented, and their performance measures are investigated. It has been
observed that Bi-LSTM model with character sequence feature and pre-trained
word vector achieved a significant state-of-the-art result.
Related papers
- Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
Formal language theory pertains specifically to recognizers.
It is common to instead use proxy tasks that are similar in only an informal sense.
We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings.
arXiv Detail & Related papers (2024-11-11T16:33:25Z) - Exploring transfer learning for Deep NLP systems on rarely annotated languages [0.0]
This thesis investigates the application of transfer learning for Part-of-Speech (POS) tagging between Hindi and Nepali.
We assess whether multitask learning in Hindi, with auxiliary tasks such as gender and singular/plural tagging, can contribute to improved POS tagging accuracy.
arXiv Detail & Related papers (2024-10-15T13:33:54Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - SanskritShala: A Neural Sanskrit NLP Toolkit with Web-Based Interface
for Pedagogical and Annotation Purposes [13.585440544031584]
We present a neural Sanskrit Natural Language Processing (NLP) toolkit named SanskritShala.
Our systems report state-of-the-art performance on available benchmark datasets for all tasks.
SanskritShala is deployed as a web-based application, which allows a user to get real-time analysis for the given input.
arXiv Detail & Related papers (2023-02-19T09:58:55Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - Transferring Knowledge Distillation for Multilingual Social Event
Detection [42.663309895263666]
Recently published graph neural networks (GNNs) show promising performance at social event detection tasks.
We present a GNN that incorporates cross-lingual word embeddings for detecting events in multilingual data streams.
Experiments on both synthetic and real-world datasets show the framework to be highly effective at detection in both multilingual data and in languages where training samples are scarce.
arXiv Detail & Related papers (2021-08-06T12:38:42Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem.
For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token.
We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Multilingual Jointly Trained Acoustic and Written Word Embeddings [22.63696520064212]
We extend this idea to multiple low-resource languages.
We jointly train an AWE model and an AGWE model, using phonetically transcribed data from multiple languages.
The pre-trained models can then be used for unseen zero-resource languages, or fine-tuned on data from low-resource languages.
arXiv Detail & Related papers (2020-06-24T19:16:02Z) - Towards Relevance and Sequence Modeling in Language Recognition [39.547398348702025]
We propose a neural network framework utilizing short-sequence information in language recognition.
A new model is proposed for incorporating relevance in language recognition, where parts of speech data are weighted more based on their relevance for the language recognition task.
Experiments are performed using the language recognition task in NIST LRE 2017 Challenge using clean, noisy and multi-speaker speech data.
arXiv Detail & Related papers (2020-04-02T18:31:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.