Colloquial Persian POS (CPPOS) Corpus: A Novel Corpus for Colloquial
Persian Part of Speech Tagging
- URL: http://arxiv.org/abs/2310.00572v1
- Date: Sun, 1 Oct 2023 05:06:33 GMT
- Title: Colloquial Persian POS (CPPOS) Corpus: A Novel Corpus for Colloquial
Persian Part of Speech Tagging
- Authors: Leyla Rabiei, Farzaneh Rahmani, Mohammad Khansari, Zeinab Rajabi,
Moein Salimi
- Abstract summary: This paper introduces a novel corpus, "Colloquial Persian POS" (CPPOS), specifically designed to support colloquial Persian text.
The corpus includes formal and informal text collected from various domains such as political, social, and commercial on Telegram, Twitter, and Instagram.
- Score: 0.9843385481559193
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Introduction: Part-of-Speech (POS) Tagging, the process of classifying words
into their respective parts of speech (e.g., verb or noun), is essential in
various natural language processing applications. POS tagging is a crucial
preprocessing task for applications like machine translation, question
answering, sentiment analysis, etc. However, existing corpora for POS tagging
in Persian mainly consist of formal texts, such as daily news and newspapers.
As a result, smart POS tools, machine learning models, and deep learning models
trained on these corpora may not perform optimally for processing colloquial
text in social network analysis. Method: This paper introduces a novel corpus,
"Colloquial Persian POS" (CPPOS), specifically designed to support colloquial
Persian text. The corpus includes formal and informal text collected from
various domains such as political, social, and commercial on Telegram, Twitter,
and Instagram more than 520K labeled tokens. After collecting posts from these
social platforms for one year, special preprocessing steps were conducted,
including normalization, sentence tokenizing, and word tokenizing for social
text. The tokens and sentences were then manually annotated and verified by a
team of linguistic experts. This study also defines a POS tagging guideline for
annotating the data and conducting the annotation process. Results: To evaluate
the quality of CPPOS, various deep learning models, such as the RNN family,
were trained using the constructed corpus. A comparison with another well-known
Persian POS corpus named "Bijankhan" and the Persian Hazm POS tool trained on
Bijankhan revealed that our model trained on CPPOS outperforms them. With the
new corpus and the BiLSTM deep neural model, we achieved a 14% improvement over
the previous dataset.
Related papers
- FarSSiBERT: A Novel Transformer-based Model for Semantic Similarity Measurement of Persian Social Networks Informal Texts [0.0]
This paper introduces a new transformer-based model to measure semantic similarity between Persian informal short texts from social networks.
It is pre-trained on approximately 104 million Persian informal short texts from social networks, making it one of a kind in the Persian language.
It has been demonstrated that our proposed model outperforms ParsBERT, laBSE, and multilingual BERT in the Pearson and Spearman's coefficient criteria.
arXiv Detail & Related papers (2024-07-27T05:04:49Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Sentiment-Aware Word and Sentence Level Pre-training for Sentiment
Analysis [64.70116276295609]
SentiWSP is a Sentiment-aware pre-trained language model with combined Word-level and Sentence-level Pre-training tasks.
SentiWSP achieves new state-of-the-art performance on various sentence-level and aspect-level sentiment classification benchmarks.
arXiv Detail & Related papers (2022-10-18T12:25:29Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Cross-Register Projection for Headline Part of Speech Tagging [3.5455943749695034]
We train a multi-domain POS tagger on both long-form and headline text.
We show that our model yields a 23% relative error reduction per token and 19% per headline.
We make POSH, the POS-tagged Headline corpus, available to encourage research in improved NLP models for news headlines.
arXiv Detail & Related papers (2021-09-15T18:00:02Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep
Learning [0.0]
A joint word segmentation and POS tagging approach using a single deep learning model is proposed.
The proposed model was trained and tested using the publicly available Khmer POS dataset.
The validation suggested that the performance of the joint model is on par with the conventional two-stage POS tagging.
arXiv Detail & Related papers (2021-03-31T04:26:54Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction [21.67895423776014]
We consider POS tagging within the framework of set-valued prediction.
We find that extending state-of-the-art POS taggers to set-valued prediction yields more precise and robust taggings.
arXiv Detail & Related papers (2020-08-04T07:21:36Z) - Machine Learning Approaches for Amharic Parts-of-speech Tagging [0.0]
Performance of the current POS taggers in Amharic is not as good as that of the contemporary POS taggers available for English and other European languages.
The aim of this work is to improve POS tagging performance for the Amharic language, which was never above 91%.
arXiv Detail & Related papers (2020-01-10T06:40:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.