Semantic relatedness between words is one of the core concepts in natural
language processing, thus making semantic evaluation an important task. In this
paper, we present a semantic model evaluation dataset: SimRelUz - a collection
of similarity and relatedness scores of word pairs for the low-resource Uzbek
language. The dataset consists of more than a thousand pairs of words carefully
selected based on their morphological features, occurrence frequency, semantic
relation, as well as annotated by eleven native Uzbek speakers from different
age groups and gender. We also paid attention to the problem of dealing with
rare words and out-of-vocabulary words to thoroughly evaluate the robustness of
semantic models.
SimRelUz: Similarity and Relatedness scores as a Semantic Evaluation
SimRelUz: 意味的評価としての類似性と関連性スコア
0.73
dataset for Uzbek language
ウズベク語のためのデータセット
0.47
Ulugbek Salaev∗, Elmurod Kuriyozov†, Carlos G´omez-Rodr´ıguez†
ウルグベク・サラエヴ∗, エルムロド・クリヨゾヴ, カルロス G ́omez-Rodr ́ıguez
0.49
∗Urgench State University, Department of Information Technologies
∗ウルゲンチ州立大学情報技術科
0.61
14, Kh.Alimdjan str, Urgench city, 220100, Uzbekistan
14 Kh. Alimdjan str, Urgench City, 220100, ウズベキスタン
0.79
ulugbek0302@gmail.co m
ulugbek0302@gmail.co m
0.34
†Universidade da Coru˜na, CITIC
コーナ大学(Universidade da Coru)、CITIC
0.50
Grupo LYS, Depto.
Grupo LYS, Depto
0.31
de Computaci´on y Tecnolog´ıas de la Informaci´on Facultade de Inform´atica, Campus de Elvi˜na, A Coru˜na 15071, Spain
de Computaci ́on y Tecnolog ́ıas de la Informaci ́on Facultade de Inform ́atica, Campus de Elvi ?
0.34
{e.kuriyozov, carlos.gomez}@udc.es
{e.kuriyozov, carlos.gomez}@udc.es
0.34
2 2 0 2 y a M 2 1
2 2 0 2 y a m 2 1 である。
0.52
] L C . s c [ 1 v 2 7 0 6 0
]LC。 sc [ 1 v 2 7 0 6 0
0.30
. 5 0 2 2 : v i X r a
. 5 0 2 2 : v i X r a
0.42
Semantic relatedness between words is one of the core concepts in natural language processing, thus making semantic evaluation an important task.
単語間の意味的関連性は自然言語処理の核となる概念の一つであり,意味的評価が重要な課題である。
0.72
In this paper, we present a semantic model evaluation dataset: SimRelUz - a collection of similarity and relatedness scores of word pairs for the low-resource Uzbek language.
The dataset consists of more than a thousand pairs of words carefully selected based on their morphological features, occurrence frequency, semantic relation, as well as annotated by eleven native Uzbek speakers from different age groups and gender.
We also paid attention to the problem of dealing with rare words and out-of-vocabulary words to thoroughly evaluate the robustness of semantic models.
また,稀な単語や語彙外単語を扱う問題にも注意を払って,意味モデルの頑健さを徹底的に評価した。
0.65
Abstract Keywords: natural language processing, uzbek language, semantic evaluation, dataset, similarity, relatedness
概要 キーワード:自然言語処理、ズベク語、意味評価、データセット、類似性、関連性
0.61
1. Introduction software1;
1. はじめに ソフトウェア1
0.57
Having computational models that can measure the semantic relatedness and semantic similarity between concepts or words is an important fundamental task for many Natural Language Processing (NLP) applications, such as word sense disambiguation (Navigli, 2009; Agirre and Edmonds, 2007), thesauri, automatic dictionary generation (Mihalcea and Moldovan, 2001; Solovyev et al , 2020), as well as machine translation (Bahdanau et al , 2014; Brown et al , 1990).
概念や単語間の意味的関連性や意味的類似性を計測できる計算モデルを持つことは、多くの自然言語処理(NLP)アプリケーションにおいて重要なタスクであり、例えば、単語感覚の曖昧さ(Navigli, 2009; Agirre and Edmonds, 2007; Thesauri、自動辞書生成(Mihalcea and Moldovan, 2001; Solovyev et al , 2020)、機械翻訳(Bahdanau et al , 2014; Brown et al , 1990)である。
0.86
There are many language models that have been created that yield good quality semantic knowledge, yet their evaluation depends on gold standard datasets that have word/concept pairs scored by their semantic relations (such as synonymy, antonymy, meronymy, hypernymy, etc.), that come with cost due to their time-consuming context-generation process and high dependence on human annotators.
Many such datasets have been created so far for resource-rich languages (Hill et al , 2015; Finkelstein et al , 2001; Rubenstein and Goodenough, 1965).
このようなデータセットの多くは、資源豊富な言語のために作成されている(Hill et al , 2015; Finkelstein et al , 2001; Rubenstein and Goodenough, 1965)。
0.80
However, there is still a big gap of such datasets available for low-resource languages.
しかし、低リソース言語で利用可能なデータセットには依然として大きなギャップがある。
0.69
Current work aims to fill that gap by providing, to our knowledge, the first semantic similarity and relatedness dataset for Uzbek language.
In this paper, we describe all the steps we followed as a set of data collection and annotation guidelines, with the full statistics and results obtained.
The main contributions of this paper are two-fold:
本論文の主な貢献は2つある。
0.75
• Publicly available word pair semantic similarity and relatedness scoring web-based questionnaire
• 公に利用可能な単語対意味的類似性と関連度スコア付け web ベースアンケート
0.65
• Publicly available semantic evaluation dataset including both similarity and relatedness scores for the low-resource Uzbek language 2;
•低リソースウズベク語2の類似性及び関連性スコアを含む公に利用可能な意味評価データセット。
0.76
Furthermore, this paper also describes some important construction considerations about the dataset considering morphological and semantic attributes for a morphologically rich language, with their visualisations.
Uzbek language (native: O‘zbek tili) is a member of the Eastern Turkic or Karluk branch of the Turkic language family, an official language of Uzbekistan, and also a second language in neighbouring Central-Asian countries.
ウズベク語(ウズベクご、ウズベク語: o'zbek tili、ウズベク語: o'zbek tili)は、ウズベキスタンの公用語であるテュルク語族の東テュルク語またはカールク語に属する言語である。
0.61
It has more than 30 million speakers inside Uzbekistan alone, and more than ten million elsewhere in Central Asian countries, Southern Russian Federation, as well as the North-Eastern part of China, making it the second most widely spoken language among Turkic languages (right after Turkish language)3.
This paper has been organised as follows: It starts with a terminology section, explaining the basic definitions of terms used in the paper, then comes a related work section followed by a description of dataset creation and annotation process, moving onto some insights of the dataset, and in the end, authors describe their discussions, conclusions, as well as future work.
1Demo website: https://simrel.urdu. uz 2Both publicly available dataset and the source code of the web-application can be found here: https://github.
1DemoのWebサイト: https://simrel.urdu. uz 2公開データセットとWebアプリケーションのソースコードは以下の通り。
0.72
com/UlugbekSalaev/Si mRelUz.
SimRelUz.com/Ulugbek Salaev/SimRelUz。
0.15
3More information about Uzbek language: https://
3 ウズベク語に関する情報は以下の通り。
0.57
en.wikipedia.org/wik i/Uzbek_language
en.wikipedia.org/wik i/uzbek_language
0.09
英語(論文から抽出)
日本語訳
スコア
2. Terminology In order to eliminate repetition, and to avoid confusion understanding the terms used in this paper, the terms similarity, relatedness, association, and distance may come with or without the prefix ”semantic” interchangeably, but they are meant to mean the same respectively.
The term semantic similarity in general, stands for a sense of relatedness that is dependent on the amount of shared properties, thus the ’degree of synonymy’.
Whereas the term semantic relatedness means a general sense of semantic proximity or semantic association, regardless of the causes of the connection humans can perceive.
For instance bus/train are good examples of semantic similarity, where they share many properties, i.e. they are both means of transport, both consume similar sorts of energy, have engines to operate, etc.
On the other hand, teapot/cup can be a good example of semantic relatedness, where they don’t necessarily share common properties, but they are used in a similar context, since they both store tea, but teapot is for steeping tea in larger amounts, while a cup is for serving and drinking tea in smaller portions.
Both above-mentioned examples can be used for semantic relatedness though, which means that semantic similarity is included inside semantic relatedness.
しかし、上記の2つの例は意味的関連性に利用できるため、意味的類似性は意味的関連性の中に含まれる。
0.58
Therefore, semantically similar things are, at the same time, semantically related, but the converse cannot be said to be the case in general.
3. Related Work The first creation of a stand-alone semantic relation evaluation dataset dates back to the RG dataset (Rubenstein and Goodenough, 1965) , which was created for semantic similarity more than relatedness4.
3.関連業務 最初の独立意味関係評価データセットの作成はRGデータセット(Rubenstein and Goodenough, 1965)に遡る。 訳抜け防止モード: 3.関連業務 単独の意味関係評価データセットの最初の作成は、RGデータセット(Rubenstein and Goodenough, 1965)にさかのぼる。 これは関係性4以上の意味的類似性のために作成された。
0.73
Although it was very small in size (limited to only 65 noun pairs), it clearly showed the scientific importance, so the research interest continued later with more datasets coming along.
The FrameNet (Baker et al , 1998) dataset is a rich linguistic resource with morphological, as well as expert-annotated semantic information as well.
FrameNet(Baker et al , 1998)データセットは、形態学的、専門家による注釈付きセマンティック情報とともに、豊富な言語資源である。
0.72
Among the most important gold-standard semantic evaluation datasets, we can find the WordSim-353 (Finkelstein et al , 2001), MEN (Bruni et al , 2012), and SimLex999 (Hill et al , 2015) datasets for English.
最も重要なゴールド標準セマンティック評価データセットのうち、WordSim-353(Finkelst ein et al , 2001)、MEN(Bruni et al , 2012)、SimLex999(Hill et al , 2015)は英語のデータセットである。
0.81
WordSim3535 contains 353 noun pairs scored by multiple human annotators.
WordSim3535には353個の名詞対が含まれている。
0.65
Similar to SimLex-353, the MEN6 dataset also is described as having similarity and relatedness distinctly, but the annotators only were asked to rate based on semantic relatedness.
tion of the SimLex-9997 dataset made it the state-ofthe-art gold standard semantic relatedness evaluation source.
simlex-9997データセットは、最先端の金本位関係評価源となった。
0.61
Some popular datasets for other languages include the RG dataset’s German translation (Gurevych, 2005), the database of paradigmatic semantic relation pairs for German (Scheible and Im Walde, 2014), and the Simlex-999’s translation into three languages: Italian, German and Russian (Leviant and Reichart, 2015).
他の言語の一般的なデータセットには、RGデータセットのドイツ語翻訳(Gurevych, 2005)、ドイツ語のパラダイム意味関係のデータベース(Scheible and Im Walde, 2014)、Simlex-999のイタリア語、ドイツ語、ロシア語の3言語への翻訳(Leviant and Reichart, 2015)などがある。
0.81
The Multi-SimLex (Vuli´c et al , 2020) project includes datasets for 12 diverse languages, including both major languages (English, Russian, Chinese, etc.) and lessresourced ones (Welsh, Kiswahili).
Multi-SimLex (Vuli ́c et al , 2020) プロジェクトは、主要言語 (英語、ロシア語、中国語など) と低リソース言語 (Welsh, Kiswahili) を含む12の多言語のためのデータセットを含んでいる。
0.81
Multi-SimLex8 was a project originated from Simlex-999, and was taken to another step by creating a larger and more comprehensive dataset.
Linguistic databases such as VerbNet (Schuler, 2005) and WordNet (Miller, 1995; Fellbaum, 2010) together with their implementations for other languages also contain semantically rich information created by experts.
Since this is the first work of this kind for Uzbek language, the closest related work would be the related resources created for other Turkic languages, such as Turkish WordNets (Tufis et al , 2004; Bakay et al , 2021), and especially AnlamVer dataset (Ercan and Yıldız, 2018), where it contains both semantic similarity and relatedness scores annotated by many native speakers.
これはウズベク語にとってこの種の最初の作品であるため、最も近い関連した作品はトルコ語のwordnets (tufis et al , 2004; bakay et al , 2021)、特にアンランバーデータセット (ercan and yıldız, 2018)のような他のテュルク語言語のために作成された関連リソースである。
0.63
Furthermore, the AnlamVer also shares useful knowledge of dataset design consideration when dealing with morphologially-rich and agglutinative languages.
Although there have been many papers published claiming that they have created NLP resources or developed some useful tools for Uzbek language, most of them, according to humble search results gathered by the authors, turned out to be “zigglebottom” papers (Pedersen, 2008).
However, there are also many useful papers with publicly available resources, some of them are the first Uzbek morphological analyzer (Matlatipov and Vetulani, 2009), transliteration (Mansurov and Mansurov, 2021a), WordNet type synsets (Agostini et al , 2021), Uzbek stopwords dataset (Madatov et al , 2021), sentiment analysis (Rabbimov et al , 2020; Kuriyozov and Matlatipov, 2019), text classification (Rabbimov and Kobilov, 2020), and even a recent pretrained Uzbek language model based on the BERT architecture (Mansurov and Mansurov, 2021b).
However, there are also many useful papers with publicly available resources, some of them are the first Uzbek morphological analyzer (Matlatipov and Vetulani, 2009), transliteration (Mansurov and Mansurov, 2021a), WordNet type synsets (Agostini et al , 2021), Uzbek stopwords dataset (Madatov et al , 2021), sentiment analysis (Rabbimov et al , 2020; Kuriyozov and Matlatipov, 2019), text classification (Rabbimov and Kobilov, 2020), and even a recent pretrained Uzbek language model based on the BERT architecture (Mansurov and Mansurov, 2021b). 訳抜け防止モード: しかし、公開されているリソースを持つ有用な論文も数多く存在する。 そのうちのいくつかはウズベク人初の形態解析器(Matlatipov and Vetulani, 2009)である。 Mansurov and Mansurov, 2021a), WordNet type synsets (Agostini et al, 2021 ) ウズベク語停止語データセット(Madatov et al, 2021 )、感情分析(Rabbimov et al, 2020 ; Kuriyozov) Matlatipov, 2019)、テキスト分類(Rabbimov, Kobilov, 2020)。 そして、BERTアーキテクチャ(MansurovとMansurov, 2021b )に基づいた最近の事前訓練されたウズベク語モデルさえも。
0.82
There is also a well established Finite State Transducer(FST) based morphological analyzer for Uzbek language with more than 60K lexemes in Apertium monolingual package9.
So we followed the design choice and recommendations brought by authors of previous work (Finkelstein et al , 2001; Bruni et al , 2012; Hill et al , 2015; Ercan and Yıldız, 2018; Vuli´c et al , 2020), such as follows:
そこで、これまでの作品(finkelstein et al , 2001 , bruni et al , 2012 , hill et al , 2015 , ercan and yıldız, 2018 ; vuli ́c et al , 2020)の著者によってもたらされた設計選択と推奨に従い、以下のようにした。
0.76
• Clear definition: The dataset must provide a clear definition of what semantic relation is supposed to be scored.
So we decided to collect scores of both similarity and relatedness separately;
そこで私たちは、類似点と関連点の両方を別々に集めることにしました。
0.48
• Language representativity: The dataset should should be built considering diverse concepts of the language, such as parts of speech (i.e. verb, noun, adjective, ...), word inflectional, or derivative), formations (root, possible semantic relations (i.e. synonymy, ...), as well as the freantonymy, meronymy, quency range (i.e. frequent words, rare words, even out-of-vocabulary words);
• Consistency and reliability: Clear and precise scoring guidelines were provided to get consistent annotations from native speakers with different level of linguistic expertise.
More detailed information regarding each criteria are given below.
各基準の詳細については下記を参照。
0.64
4.1. Design choice For the design of the dataset we followed the AnlamVer project (Ercan and Yıldız, 2018), where instead of building two separate datasets for semantic similarity and relatedness, we decided to rate each word pair with two separate scores: one for similarity, and another for relatedness.
4.1. 設計選択 データセットの設計には、AnlamVerプロジェクト(Ercan and Yıldız, 2018)に従い、セマンティック類似性と関連性のための2つの別々のデータセットを構築する代わりに、各ワードペアを2つのスコア(類似性のための1つと関連性のための2つ)で評価することにしました。
0.53
This way, the resulting dataset was smaller in size, but richer in information.
このようにして得られたデータセットのサイズは小さくなり、情報も豊富になった。
0.65
Moreover, this approach gave us an opportunity to visualize the dataset as a semantic relation space, using two scores as two dimensions, and creating a scatter plot.
According to the methodology proposed by AnlamVer (Ercan and Yıldız, 2018) project, it is possible to predict the semantic relation of word pairs, by their location in the ”Sim-Rel vector space”, which is given in Figure 1.
AnlamVer(Ercan and Yıldız, 2018)プロジェクトによって提案された方法論によれば、単語ペアの意味的関係を、図1に示す“Sim-Rel vector space”に配置することで予測することが可能である。
0.83
4.2. Word candidates selection Probably a relatively easy way to obtain candidate words with minimum work would be translating words from gold-standard resources available for richresource languages (i.e. Multi-Simlex (Vuli´c et al , 2020)).
4.2. 単語候補の選択 最小限の作業で候補語を得る比較的簡単な方法は、リッチソース言語で利用可能なゴールドスタンダードリソース(Multi-Simlex (Vuli ́c et al , 2020))から単語を翻訳することであろう。
0.55
However, there have been various relevant problems that have been reported to be caused by the use of such translations, such as:
しかし、次のような訳文の使用によって引き起こされたと報告されている問題もいくつかある。
0.67
• Two synonym pairs from a source language being mapped to one word in target language (Both
•対象言語の1語にマッピングされるソース言語からの2つの同義語ペア(両方)
0.80
words in car - automobile pair in English would be mapped to a single avtomobil in Uzbek);
自動車の単語 - 英語の自動車ペアはウズベクの1つのavtomobilにマッピングされる)。
0.69
• A translation of a single word in a source language that makes it multiple words in a target one (the word asylum in English would be translated as ruhiy kasalliklar shifoxonasi in Uzbek);
• Loss in the similarity/relatedne ss scores due to other cross-lingual aspects of pairs, such as translation accuracy or semantic/grammatical /cultural differences, require human annotators to re-score, leaving the costly part to be done again.
Therefore, we decided to choose the candidate wordlist ourselves for better quality.
そこで我々は、より良い品質のために候補単語リストを自分で選ぶことに決めた。
0.59
The first thing to make was a comprehensive list of words in the language using a big language corpus.
最初に作るのは、大きな言語コーパスを使った言語内の単語の包括的なリストです。
0.76
For the language corpus mentioned in this work, we used the Uzbek corpus from the CUNI corpora for Turkic languages (Baisa et al , 2012), which is, to our knowledge, the biggest Uzbek corpus collected with 18M tokens.
この研究で言及された言語コーパスについては、トルコ語のためのCUNIコーパス(Baisa et al , 2012)のウズベクコーパスを使用しました。 訳抜け防止モード: 本研究で言及されている言語コーパスには,クニコーポラのウズベク語コーパスを用いている(baisa et al, 2012)。 私たちの知る限りでは、18mのトークンで収集されたウズベク最大のコーパスです。
0.54
To obtain their partof-speech (POS) tags, we used the UzWordNET dataset (Agostini et al , 2021) (which contains very limited information of root words with their POS classes), and Apertium-Uzb monolingual data10 (contains more than 60K of Uzbek root words with their POS tags).
pos (partof-speech) タグを得るために, uzwordnet データセット (agostini et al , 2021) と apertium-uzb モノリンガルデータ10 (posタグで 60k 以上の uzbek ルートワードを含む) を用いた。 訳抜け防止モード: 音声(POS)タグを取得する。 私たちはUzWordNETデータセット(Agostini et al, 2021 )を使用しました。 POSクラスを持つルート語の非常に限られた情報を含む Apertium - Uzb monolingual data10 (POSタグで60K以上のウズベク語根語を含む)。
0.79
Then we extracted nouns, adjectives and verbs only (with descending order relatively, according to their frequencies in the corpus), following the custom of similar goldstandard semantic evaluation resources.
Apart from only root forms of words, we also did manual selection of words with inflectional and derivational forms of words.
また, 単語の根形だけでなく, 屈折形, 導出形の単語も手作業で選択した。
0.68
4.3. Frequency-based considerations Considering the agglutinative nature of Uzbek language, creating the list of word frequencies in this language is not an easy task, since a single word can occur together with many different morphemes (either a single morpheme or a combination of many), making it difficult to obtain the actual count of occurrences of a single root-word.
In this paper, we created a list of stems with their frequencies in Uzbek language using the biggest available Uzbek corpora (Baisa et al., 2012).
本稿では,最大利用可能なウズベク語コーパス(Baisa et al., 2012)を用いて,ウズベク語話者の頻度を表わした茎のリストを作成した。
0.63
Firstly, the CUNI corpus was tokenized into sentences, then all the sentences were fed to the Apertium morphological analyser tool for Uzbek language11.
Then, all the parts except for the lemmas of the resulting output were removed, which allowed us to obtain a stem/root-word frequency list.
結果,出力の補題を除く全ての部分が除去され,ステム/ルートワードの周波数リストが得られた。
0.70
Our priority was to include as many words with different frequencies as possible, so we used a technique similar
我々の優先事項は、できるだけ多くの単語を異なる周波数で含めることであった。
0.69
10https://github.com /apertium/
10https://github.com /apertium/
0.20
apertium-uzb 11Although we
apertium-uzb 11 ですが
0.54
have on used the Apertium morphoological be https://turkic.apert ium.org/index.eng.
持ってる オン Apertium morphoological be https://turkic.apert ium.org/index.eng
0.46
html? choice=uzb#analyzation
html? choice=uzb#analyzation
0.41
the CLI analyzer, check to
CLIアナライザ、チェックします。
0.74
the web accessed version
ウェブは アクセス バージョン
0.74
of also can features: it its
特徴もあります それは...
0.37
英語(論文から抽出)
日本語訳
スコア
Figure 1: Semantic relation vector-space (proposed by AnlamVer project).
図1: 意味的関係ベクトル空間(AnlamVerプロジェクトによって提案)。
0.77
to the one issued by the RareWords dataset (Luong et al., 2013) - grouping words by their frequencies, dividing into three groups labeled as low, medium, high with [2,5],[6,49],[50+] count ranges respectively.
RareWords データセット (Luong et al., 2013) で発行されたデータに対して,単語をその頻度でグループ化し,[2,5],[6,49],[50+] の3つのグループにそれぞれ分類する。
0.71
4.4. Rare and OOV words Furthermore, to make the dataset useful for checking the robustness of the semantic models, considering less-frequent words, even words that do not exist in the language dictionary but might appear in the context due to some morphological (surface words), syntactical (typo), or phonetical (homophones) reasons is also an important aspect.
Thus, the words where their root form does not appear more than 3 times in the corpus were grouped as rare words, and their representatives were manually selected for the word list.
Considering the rich morphological aspect of Uzbek language, like other Turkic languages, there is a high inflection and derivation rate, where words are made in an agglutinative way: by combining stem and one or more morphemes (as prefix or suffixes).
Hence, there is a high chance that a word may be grammatically wrong, but was created following surface-word creation rules (of which almost an unlimited number can be created).
So we chose the following two most common out-of-vocabulary word cases, which are formally incorrect, but considered as acceptable forms for native speakers, and added some examples to the dataset:
• Phonetic ambiguation: Two letters in Uzbek alphabet: “x” and “h” are phonetically so close to each-other, it is hard to identify them when used in a context, so people frequently mistake one for another when writing.
•音声の曖昧さ ウズベク語アルファベットの2つの文字: "x" と "h" は音素的に互いに非常に近いので、文脈で使用すると区別が難しいため、人々は書くときに互いに間違えることが多い。
0.76
E.g. pahta instead of paxta (cotton), shaxzoda instead of shahzoda (prince).
In total, 128 examples from both rare and OOV words with diverse POS types and word forms were added to the dataset.
データセットにPOSタイプとワードフォームを多用したレアおよびOOV単語の合計128例を付加した。
0.68
After going through all the above mentioned steps and considerations, we gathered 1963 unique words to construct pairs.
上述のステップと考慮事項をすべて網羅した後、1963年に一意な単語を集めてペアを構築した。
0.66
All their distribution among ford types, word forms, as well as word frequencies are given in Table 1.
表1には、フォード型、単語形式、および単語頻度の分布が記載されている。
0.69
4.5. Word pairs selection Choosing word pairs randomly and scoring them would require the dataset to be huge in size, taking a very long time to annotate, so we tried to provide best quality semantic evaluation dataset with a limited number of word pairs by pre-establishing common semantic relations, such as synonymity, antonymity, hypernymity, and meronymity.
Furthermore, we added word pairs by random allocation, which we named this category of pairs ”irrelevant” (not in the sense of irrelevant pairs but in the sense of the magnitude of their semantic similarity and relatedness, as they are more likely to have very low scores on both sides).
tic background, from different age groups and genders, have participated at the annotation, rating each pair once, with two scores (one for similarity, and the other for relatedness) from 0 to 10.
Based on a statistical analysis from (Snow et al , 2008), more than ten annotators for a semantic evaluation are reliable enough.
Snow et al , 2008) の統計分析に基づいて, 意味的評価のための10以上のアノテータが十分信頼できる。
0.74
In the end, there were eleven scores of similarity and the same amount for relatedness for each word pair, and we took their averages as the final scores.
最後に,11点の類似点と各単語ペアの関連点の同一量があり,その平均値を最終スコアとした。
0.66
Figure 3 shows the distribution of age and gender between annotators.
図3は、アノテータ間の年齢と性別の分布を示しています。
0.58
Category Synonyms Antonyms Hypernyms Meronyms Irrelevant/Random Total
カテゴリーシノニム antonyms hypernyms meronyms irrelevant/random total
0.74
# of word pairs 639 239 220 193 127 1418
#ワードペア 639 239 220 193 127 1418
0.47
Table 2: Distribution of word pairs by their preestablished semantic relations.
表2: 予め確立された意味関係による単語対の分布
0.79
Figure 3: Distribution of annotators based on gender and age-groups.
図3:性別と年齢グループに基づくアノテータの分布
0.76
5. Annotation process
5.アノテーションのプロセス
0.89
For the annotation process, we have created a webbased survey application where each annotator is given a unique username and password, where they can access the website and rate given word pairs with two separate scores at once.
General user interface of the annotation page can be seen in Figure 2.
アノテーションページの一般的なユーザインターフェースは図2で見ることができる。
0.85
In total, eleven annotators (including two authors), who are native Uzbek speakers with different linguis-
総じて11人のアノテーター(2人の著者を含む)は、異なるリンギスを持つウズベク語話者である
0.60
6. Results is composed of 1418 word The resulting dataset pairs from different word types (nouns, adjectives and verbs), different word forms (root, inflectional, derivational), with different frequencies (high, mid, low frequencies, rare and OOV words), and with diverse pre-established semantic relations (synonym, antonym, meronym, hypernym, not related).
All the pairs have two scores, one for semantic similarity, while the other
すべてのペアは2つのスコアを持ち、1つは意味的類似性で、もう1つは
0.55
英語(論文から抽出)
日本語訳
スコア
Figure 4: Visualisation of the created dataset in a Sim-Rel vector space.
図4: Sim-Relベクトル空間で生成されたデータセットの可視化。
0.79
is for semantic relatedness. No field in the dataset was left empty (as was requested from annotators in the guidelines, even for the OOV cases), and the average pairwise inter-annotator agreement scores (apia) were computed for both semantic similarity and relatedness separately, where we achieved 0.71 and 0.69 apia scores for semantic similarity and relatedness respectively, meaning that although we have scored less than AnlamVer dataset (0.75), it still performed better than most semantic evaluation datasets (SimLex=0.67, MEN=0.68).
意味的関連性です No field in the dataset was left empty (as was requested from annotators in the guidelines, even for the OOV cases), and the average pairwise inter-annotator agreement scores (apia) were computed for both semantic similarity and relatedness separately, where we achieved 0.71 and 0.69 apia scores for semantic similarity and relatedness respectively, meaning that although we have scored less than AnlamVer dataset (0.75), it still performed better than most semantic evaluation datasets (SimLex=0.67, MEN=0.68). 訳抜け防止モード: 意味的関連性です ガイドラインのアノテータから要求されたように)データセットのフィールドは空になっていなかった。 OOV の場合においても) と平均的なペア間のアノテータ合意スコア (apia ) は, 意味的類似性と関連性の両方を別々に計算した。 意味的類似性と関連性について それぞれ0.71と0.69のアピアスコアを達成しました つまり、AnlamVerデータセット (0.75 ) よりもスコアが低いものの、 多くのセマンティック評価データセット(SimLex=0.67, MEN=0.68 )よりもパフォーマンスが良い。
0.62
The resulting dataset can be plotted into the Sim-Rel vector space as shown in Figure 4.
得られたデータセットは、図4に示すようにsim-relベクトル空間にプロットできる。
0.78
Discussions. As can be seen from the scatter plot of the dataset in a vector space (Figure 4), it can be concluded that average scores of word pairs visually correlate to our pre-established relation types, since they are scattered mostly inside and around the determined areas in the vector-space.
Irrelevant and random pairs can be easily detected from the plot, that it has no much overlap with other types.
無関係かつランダムなペアはプロットから容易に検出でき、他のタイプと重なり合っていない。
0.68
It is also worth mentioning that none of the word pair is in the Similar-Unrelated (top-left quarter of the vector-space) part of the plot, confirming its reliability, since a word cannot be similar, but not related at once.
7. Conclusion In this paper, we presented SimRelUz, a novel semantic evaluation dataset for the low-resource Uzbek language, with semantic similarity and relatedness scores for 1418 word pairs, which were selected based on their morphological classes, word-forms, frequencies, also including rare and out-of-vocabulary words for better evaluation of semantic language models.
This kind of dataset is a useful resource to be used for evaluation of computational semantic analysis systems that will be created in the future for Uzbek, in simpler words, for formal analysis of meaning in language models.
8. Acknowledgements This work has received funding from ERDF/MICINNAEI (SCANNER-UDC, PID2020-113230RB-C21 ), from Xunta de Galicia (ED431C 2020/11), and from Centro de Investigaci´on de Galicia “CITIC”, funded by Xunta de Galicia and the European Union (ERDF - Galicia 2014-2020 Program), by grant ED431G 2019/01.
八 認定 この研究は、ERDF/MICINNAEI (SCANNER-UDC, PID 2020-113230RB-C21), Xunta de Galicia (ED431C 2020/11), Centro de Investigaci ́on de Galicia "CITIC" (Xunta de Galicia and the Union (ERDF - Galicia 2014-2020 Program, ED431G 2019/01)から資金提供を受けている。
0.58
Elmurod Kuriyozov was funded for his PhD by El-Yurt-Umidi Foundation under the Cabinet of Ministers of the Republic of Uzbekistan.
The authors would also like to thank the NLP team of Urgench State University for their tremendous help with the web hosting, and annotation.
また、Urgench State UniversityのNLPチームがWebホスティングとアノテーションを大いに助けてくれたことに感謝します。
0.56
9. Bibliographical References
第9条 書誌的参考文献
0.52
Agirre, E. and Edmonds, P. (2007).
Agirre, E. and Edmonds, P. (2007)。
0.48
Word sense disambiguation: Algorithms and applications, volume 33.
word sense disambiguation: アルゴリズムとアプリケーション、ボリューム33。
0.75
Springer Science & Business Media.
Springer Science & Business Media(英語)
0.74
Agostini, A., Usmanov, T., Khamdamov, U., Abdurakhmonova, N., and Mamasaidov, M. (2021).
Agostini, A., Usmanov, T., Khamdamov, U., Abdurakhmonova, N., Mamasaidov, M. (2021)。
0.41
Uzwordnet: A lexical-semantic database for the uzbek language.
uzwordnet: ウズベク語のための語彙論的データベース。
0.71
In Proceedings of the 11th Global Wordnet conference, pages 8–19.
第11回world wordnet conferenceの議事録8-19頁。
0.71
Bahdanau, D., Cho, K., and Bengio, Y. (2014).
Bahdanau, D., Cho, K. and Bengio, Y. (2014)。
0.81
Neural machine translation by jointly learning to align and translate.
整列と翻訳を共同で学習することで、ニューラルマシン翻訳を行う。
0.60
arXiv preprint arXiv:1409.0473.
arXiv preprint arXiv:1409.0473
0.36
Baisa, V., Suchomel, V., et al
Baisa, V., Suchomel, V., et al
0.42
(2012). Large corpora for turkic languages and unsupervised morphological analysis.
(2012). トルコ語の大規模コーパスと教師なし形態解析
0.49
In Proceedings of the Eighth conference on International Language Resources and Evaluation (LREC’12), Istanbul, Turkey.
第8回国際言語資源評価会議(lrec'12)において、トルコのイスタンブールで開催された。
0.76
European Language Resources Association (ELRA).
欧州言語資源協会 (ELRA) の略。
0.77
Bakay, ¨O. , Ergelen,
バケイ という。 エルゲレン(Ergelen)。
0.42
¨O. , Sarmıs¸, E., Yıldırım, S., ¨Ozc¸elik, M., Arıcan, B. N., Kocabalcıo˘glu, A., Sanıyar, E., Kuyrukc¸u, O., Avar, B., et al (2021).
という。 Sarmıs、E.、Yıldırım、S.、Ozc、Zelik、M.、Arıcan、B.N.、Kocabalcıo、A.、Sanıyar、E.、Kuyrukc、Uu、O.、Avar、B.、al(2021年) 訳抜け防止モード: という。 S., Sarmıs , E., Yıldırım, S., オズック・シェリク, M. Arıcan, B. N., Kocabalcıo sglu, A., Sanıyar, E. クィルクック・シュウ, O., Avar, B., et al (2021年)。
0.71
Turkish wordnet kenet.
トルコ語で「ケネット」。
0.44
In Proceedings of the 11th global wordnet conference, pages 166–174.
第11回world wordnet conferenceの議題166-174頁。
0.70
Baker, C. F., Fillmore, C. J., and Lowe, J. B. (1998).
Baker, C. F., Fillmore, C. J., Lowe, J. B. (1998)。
0.85
In COLING 1998 The berkeley framenet project.
1998年 バークレー・フレームネット・プロジェクト。
0.63
Volume 1: The 17th International Conference on Computational Linguistics.
第1巻:第17回計算言語学国際会議。
0.78
Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J., Mercer, R. L., and Roossin, P. S. (1990).
Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J., Mercer, R. L., and Roossin, P. S. (1990)。 訳抜け防止モード: ブラウン、p.f.、コック、j.、デラ・ピエトラ s. a., della pietra, v. j., jelinek, f. lafferty, j., mercer, r. l., and roossin, p. s. (1990)。
0.82
A statistical approach to machine translation.
機械翻訳への統計的アプローチ。
0.83
Computational linguistics, 16(2):79–85.
計算言語学 16(2):79-85。
0.70
Bruni, E., Boleda, G., Baroni, M., and Tran, N.
Bruni, E., Boleda, G., Baroni, M., Tran, N。
0.75
-K. (2012).
-K。 (2012).
0.42
Distributional semantics in technicolor.
テクニカラーにおける分布意味論
0.65
In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 136–145.
第50回計算言語学会年次大会(Volume 1: Long Papers)において、136-145頁。
0.59
Ercan, G. and Yıldız, O. T. (2018).
Ercan, G. and Yıldız, O. T. (2018)。
0.47
Anlamver: Semantic model evaluation dataset for turkish-word similarity and relatedness.
Anlamver: トルコ語の単語の類似性と関連性のセマンティックモデル評価データセット。
0.61
In Proceedings of the 27th International Conference on Computational Linguistics, pages 3819–3836.
第27回計算言語学国際会議紀要3819-3836頁。
0.60
Fellbaum, C. (2010).
Fellbaum, C. (2010)。
0.46
Wordnet. In Theory and appli-
ワードネット 理論と応用-
0.66
cations of ontology: computer applications, pages 231–243.
オントロジーのカチオン:コンピュータ応用、231–243ページ。
0.73
Springer. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., and Ruppin, E. (2001).
Springer Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., and Ruppin, E. (2001)。
0.56
Placing search in context: The concept revisited.
検索をコンテキストに配置する: コンセプトを再検討する。
0.65
In Proceedings of the 10th international conference on World Wide Web, pages 406–414.
第10回world wide web国際会議の議事録406-414頁。
0.73
Gurevych, I. (2005).
Gurevych, I. (2005)。
0.40
Using the structure of a conceptual network in computing semantic relatedness.
セマンティック関連性計算における概念ネットワークの構造の利用。
0.89
In International conference on natural language processing, pages 767–778.
自然言語処理に関する国際会議、767-778頁。
0.82
Springer. Hill, F., Reichart, R., and Korhonen, A.
Springer Hill, F., Reichart, R., and Korhonen, A。
0.54
(2015). Simlex-999: Evaluating semantic models with (genuine) similarity estimation.
(2015). Simlex-999: (genuine) 類似度推定による意味モデルの評価。
0.60
Computational Linguistics, 41(4):665–695.
計算言語学、41(4):665–695。
0.71
Kuriyozov, E. and Matlatipov, S. (2019).
Kuriyozov, E. and Matlatipov, S. (2019)。
0.48
Building a new sentiment analysis dataset for uzbek language and creating baseline models.
uzbek言語のための新しい感情分析データセットを構築し、ベースラインモデルを作成する。
0.73
In Multidisciplinary Digital Publishing Institute Proceedings, volume 21, page 37.
Multidisciplinary Digital Publishing Institute Proceedings』21巻37頁。
0.63
Leviant, I. and Reichart, R.
Leviant, I. and Reichart, R.
0.47
(2015). Judgment language matters: Multilingual vector space models for judgment language aware lexical semantics.