Development of Word Embeddings for Uzbek Language
- URL: http://arxiv.org/abs/2009.14384v1
- Date: Wed, 30 Sep 2020 01:52:00 GMT
- Title: Development of Word Embeddings for Uzbek Language
- Authors: B. Mansurov and A. Mansurov
- Abstract summary: We share the process of developing word embeddings for the Cyrillic variant of the Uzbek language.
The result is the first publicly available set of word vectors trained on the word2vec, GloVe, and fastText algorithms.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we share the process of developing word embeddings for the
Cyrillic variant of the Uzbek language. The result of our work is the first
publicly available set of word vectors trained on the word2vec, GloVe, and
fastText algorithms using a high-quality web crawl corpus developed in-house.
The developed word embeddings can be used in many natural language processing
downstream tasks.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Creating a morphological and syntactic tagged corpus for the Uzbek
language [0.0]
We develop a novel Part Of Speech (POS) and syntactic tagset for creating the syntactic and morphologically tagged corpus of the Uzbek language.
Based on the developed annotation tool and the software, we share our experience results of the first stage of tagged corpus creation.
arXiv Detail & Related papers (2022-10-27T07:44:12Z) - The Open corpus of the Veps and Karelian languages: overview and
applications [52.77024349608834]
The Open Corpus of the Veps and Karelian Languages (VepKar) is an extension of the Veps created in 2009.
The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search.
Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs.
arXiv Detail & Related papers (2022-06-08T13:05:50Z) - Deconstructing word embedding algorithms [17.797952730495453]
We propose a retrospective on some of the most well-known word embedding algorithms.
In this work, we deconstruct Word2vec, GloVe, and others, into a common form, unveiling some of the common conditions that seem to be required for making performant word embeddings.
arXiv Detail & Related papers (2020-11-12T14:23:35Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - RUSSE'2020: Findings of the First Taxonomy Enrichment Task for the
Russian language [70.27072729280528]
This paper describes the results of the first shared task on taxonomy enrichment for the Russian language.
16 teams participated in the task demonstrating high results with more than half of them outperforming the provided baseline.
arXiv Detail & Related papers (2020-05-22T13:30:37Z) - Cross-Lingual Word Embeddings for Turkic Languages [1.418033127602866]
Cross-lingual word embeddings can transfer knowledge from a resource-rich language to a low-resource one.
We show how to obtain cross-lingual word embeddings in Turkish, Uzbek, Azeri, Kazakh and Kyrgyz languages.
arXiv Detail & Related papers (2020-05-17T18:57:23Z) - A Survey on Contextual Embeddings [48.04732268018772]
Contextual embeddings assign each word a representation based on its context, capturing uses of words across varied contexts and encoding knowledge that transfers across languages.
We review existing contextual embedding models, cross-lingual polyglot pre-training, the application of contextual embeddings in downstream tasks, model compression, and model analyses.
arXiv Detail & Related papers (2020-03-16T15:22:22Z) - Word Sense Disambiguation for 158 Languages using Word Embeddings Only [80.79437083582643]
Disambiguation of word senses in context is easy for humans, but a major challenge for automatic approaches.
We present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory.
We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings.
arXiv Detail & Related papers (2020-03-14T14:50:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.