Accuracy of the Uzbek stop words detection: a case study on "School
corpus"
- URL: http://arxiv.org/abs/2209.07053v1
- Date: Thu, 15 Sep 2022 05:14:31 GMT
- Title: Accuracy of the Uzbek stop words detection: a case study on "School
corpus"
- Authors: Khabibulla Madatov, Shukurla Bekchanov, Jernej Vi\v{c}i\v{c}
- Abstract summary: We present a method to evaluate the quality of a list of stop words aimed at automatically creating techniques.
The method was tested on an automatically-generated list of stop words for the Uzbek language.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stop words are very important for information retrieval and text analysis
investigation tasks of natural language processing. Current work presents a
method to evaluate the quality of a list of stop words aimed at automatically
creating techniques. Although the method proposed in this paper was tested on
an automatically-generated list of stop words for the Uzbek language, it can
be, with some modifications, applied to similar languages either from the same
family or the ones that have an agglutinative nature. Since the Uzbek language
belongs to the family of agglutinative languages, it can be explained that the
automatic detection of stop words in the language is a more complex process
than in inflected languages. Moreover, we integrated our previous work on stop
words detection in the example of the "School corpus" by investigating how to
automatically analyse the detection of stop words in Uzbek texts. This work is
devoted to answering whether there is a good way of evaluating available stop
words for Uzbek texts, or whether it is possible to determine what part of the
Uzbek sentence contains the majority of the stop words by studying the
numerical characteristics of the probability of unique words. The results show
acceptable accuracy of the stop words lists.
Related papers
- Curating Stopwords in Marathi: A TF-IDF Approach for Improved Text Analysis and Information Retrieval [0.4499833362998489]
Stopwords are commonly used words in a language that are considered to be of little value in determining the meaning or significance of a document.
Our work targets the curation of stopwords in the Marathi language using the MahaCorpus, with 24.8 million sentences.
arXiv Detail & Related papers (2024-06-16T17:59:05Z) - ''You should probably read this'': Hedge Detection in Text [8.890331069484203]
Humans express ideas, beliefs, and statements through language.
In this work, we apply a joint model that leverages words and part-of-speech tags to improve hedge detection in text and achieve a new top score on the CoNLL-2010 Wikipedia corpus.
arXiv Detail & Related papers (2024-05-22T03:25:35Z) - Text Categorization Can Enhance Domain-Agnostic Stopword Extraction [3.6048839315645442]
This paper investigates the role of text categorization in streamlining stopword extraction in natural language processing (NLP)
By leveraging the MasakhaNEWS, African Stopwords Project, and MasakhaPOS datasets, our findings emphasize that text categorization effectively identifies domain-agnostic stopwords with over 80% detection success rate for most examined languages.
arXiv Detail & Related papers (2024-01-24T11:52:05Z) - Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves.
We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features.
Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z) - UzbekTagger: The rule-based POS tagger for Uzbek language [0.0]
This research paper presents a part-of-speech annotated dataset and tagger tool for the low-resource Uzbek language.
The dataset includes 12 tags, which were used to develop a rule-based POS-tagger tool.
The presented dataset is the first of its kind to be made publicly available for Uzbek, and the POS-tagger tool created can also be used as a pivot to use as a base for other closely-related Turkic languages.
arXiv Detail & Related papers (2023-01-30T07:40:45Z) - Uzbek affix finite state machine for stemming [0.0]
The proposed methodology is a morphologic analysis of Uzbek words by using an affix to find a root and without including any lexicon.
This method helps to perform morphological analysis of words from a large amount of text at high speed as well as it is not required using of memory for keeping vocabulary.
arXiv Detail & Related papers (2022-05-20T10:46:53Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.