Part-of-speech tagging for Nagamese Language using CRF
- URL: http://arxiv.org/abs/2509.19343v3
- Date: Mon, 13 Oct 2025 16:54:53 GMT
- Title: Part-of-speech tagging for Nagamese Language using CRF
- Authors: Alovi N Shohe, Chonglio Khiamungam, Teisovi Angami,
- Abstract summary: This paper investigates part-of-speech tagging, an important task in Natural Language Processing (NLP) for the Nagamese language.<n>An annotated corpus of 16,112 tokens is created and applied machine learning technique known as Conditional Random Fields (CRF)<n>Using CRF, an overall tagging accuracy of 85.70%; precision, recall of 86%, and f1-score of 85% is achieved.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper investigates part-of-speech tagging, an important task in Natural Language Processing (NLP) for the Nagamese language. The Nagamese language, a.k.a. Naga Pidgin, is an Assamese-lexified Creole language developed primarily as a means of communication in trade between the Nagas and people from Assam in northeast India. A substantial amount of work in part-of-speech-tagging has been done for resource-rich languages like English, Hindi, etc. However, no work has been done in the Nagamese language. To the best of our knowledge, this is the first attempt at part-of-speech tagging for the Nagamese Language. The aim of this work is to identify the part-of-speech for a given sentence in the Nagamese language. An annotated corpus of 16,112 tokens is created and applied machine learning technique known as Conditional Random Fields (CRF). Using CRF, an overall tagging accuracy of 85.70%; precision, recall of 86%, and f1-score of 85% is achieved. Keywords. Nagamese, NLP, part-of-speech, machine learning, CRF.
Related papers
- Sentiment Analysis and Emotion Classification using Machine Learning Techniques for Nagamese Language - A Low-resource Language [0.0]
The aim of this work is to detect sentiments in terms of polarity (positive, negative and neutral) and basic emotions contained in Nagamese language.<n>We build sentiment polarity lexicon of 1,195 nagamese words and use these to build features for supervised machine learning techniques.
arXiv Detail & Related papers (2025-12-01T04:01:29Z) - Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages [6.74683227658822]
India has 1369 languages, with 22 official using 13 scripts.<n>Our work focuses on zero-shot synthesis, particularly for languages whose scripts and phonotactics come from different families.<n>Intelligible and natural speech was generated for Sanskrit, Maharashtrian and Canara Konkani, Maithili and Kurukh.
arXiv Detail & Related papers (2025-06-04T12:22:24Z) - Survey of Pseudonymization, Abstractive Summarization & Spell Checker for Hindi and Marathi [0.0]
The paper aims to build a platform which enables the user to use various features like text anonymization, abstractive text summarization and spell checking in English, Hindi and Marathi language.<n>The aim of these tools is to serve enterprise and consumer clients who predominantly use Indian Regional languages.
arXiv Detail & Related papers (2024-12-24T04:51:32Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec
Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis.
VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks.
It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z) - AsPOS: Assamese Part of Speech Tagger using Deep Learning Approach [7.252817150901275]
Part of Speech (POS) tagging is crucial to Natural Language Processing (NLP)
We present a Deep Learning (DL)-based POS tagger for Assamese.
We attain a tagging accuracy of 86.52% in F1 score.
arXiv Detail & Related papers (2022-12-14T05:36:18Z) - Code-Switching without Switching: Language Agnostic End-to-End Speech
Translation [68.8204255655161]
We treat speech recognition and translation as one unified end-to-end speech translation problem.
By training LAST with both input languages, we decode speech into one target language, regardless of the input language.
arXiv Detail & Related papers (2022-10-04T10:34:25Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - Including Signed Languages in Natural Language Processing [48.62744923724317]
Signed languages are the primary means of communication for many deaf and hard of hearing individuals.
This position paper calls on the NLP community to include signed languages as a research area with high social and scientific impact.
arXiv Detail & Related papers (2021-05-11T17:37:55Z) - BNLP: Natural language processing toolkit for Bengali language [0.0]
BNLP is an open source language processing toolkit for Bengali language.
It consists of tokenization, word embedding, POS tagging, NER tagging facilities.
BNLP is using widely in the Bengali research communities with 16K downloads, 119 stars and 31 forks.
arXiv Detail & Related papers (2021-01-31T07:56:08Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z) - Speech Corpus of Ainu Folklore and End-to-end Speech Recognition for
Ainu Language [32.6535407800833]
Ainu is an unwritten language that has been spoken by Ainu people who are one of the ethnic groups in Japan.
It is recognized as critically endangered by UNESCO and archiving and documentation of its language heritage is of paramount importance.
We started a project of automatic speech recognition (ASR) for the Ainu language in order to contribute to the development of annotated language archives.
arXiv Detail & Related papers (2020-02-16T20:44:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.