In this technical report we propose an algorithm, called Lex2vec, that
exploits lexical resources to inject information into word embeddings and name
the embedding dimensions by means of distant supervision. We evaluate the
optimal parameters to extract a number of informative labels that is readable
and has a good coverage for the embedding dimensions.
Lex2vec: making Explainable Word Embedding via Distant
Lex2vec: Distantによる説明可能な単語埋め込み
0.68
Supervision Fabio Celli Maggioli Research & Development
監督ファビオセル Maggioli の研究開発
0.64
via Bornaccino 101
Bornaccino 101 経由。
0.79
Santarcangelo di Romagna, Italy fabio.celli@maggioli .it
Santarcangelo di Romagna, Italy fabio.celli@maggioli .it
0.81
Abstract In this technical report, we propose an algorithm, called Lex2vec that exploits lexical resources to inject information into word embeddings and name the embedding dimensions by means of distant supervision.
We evaluate the optimal parameters to extract a number of informative labels that is readable and has a good coverage for the embedding dimensions.
本研究は, 可読性が高く, 埋込寸法が良好な情報ラベルを抽出するための最適パラメータの評価を行う。
0.72
Introduction and Related Work 1 From 2000 to 2020, Natural Language Processing adopted several approaches for the study of semantics, from lexical semantics to distributional semantics and word embeddings, context-free or transformer-based.
Lexical semantics had the advantage to be fully interpretable, and even the most like semantic relations [Celli, abstract concepts, 2009b] or qualia structures [Pustejovsky and Jezek, 2008] were encoded by labels [Celli, 2010] and classified [Celli, 2009a].
In distant supervision we make use of an already existing database, such as Freebase or a domain-specific database, to collect and label examples for the relation we want to extract.
This approach worked very well in semantic relation extraction tasks [Smirnova and Cudr´e-Mauroux, 2018].
このアプローチは、意味関係抽出タスク [Smirnova and Cudr ́e-Mauroux, 2018] で非常にうまく機能しました。
0.61
Distributional semantics solved the word
分布意味論が単語を解いた
0.64
ambiguity problem by computing co-occurrence word vectors that made possible to measure the distance between similar words in multidimensional conceptual spaces [Mohammad and Hirst, 2012], opening new possibilities for the extraction of semantic relations [Celli and Nissim, 2009].
多次元概念空間における類似語間の距離を測定することを可能にした共起語ベクトルの計算による曖昧性問題[Mohammad and Hirst, 2012]は、意味関係の抽出の新しい可能性を開く[Celli and Nissim, 2009]。
0.86
Anyway, although the resulting matrices are interpretable, they are also huge and very sparse, and this is a limitation for supervised learning.
Context-free Word embeddings [Mikolov et al , 2013] solve the sparsity problem by using neural network representations to embed many word context dimensions into features.
文脈自由な単語埋め込み [Mikolov et al , 2013] は、ニューラルネットワーク表現を用いて多くの単語コンテキスト次元を特徴に埋め込むことにより、空間性問題を解決する。 訳抜け防止モード: context - free word embeddeds [ mikolov et al, 2013 ] solve the sparsity problem by ニューラルネットワーク表現を使用して、多くの単語コンテキストの次元を特徴に埋め込む。
0.72
Doing so, they reduce the feature space and boost the predictive power in semantic relation extraction tasks [G´abor et al , 2018], but definitely reopen the word disambiguation problem and turn the meaning of each dimension totally opaque.
そうすることで、特徴空間を減らし、意味関係抽出タスク[G ́abor et al , 2018]の予測力を高めますが、単語の曖昧化問題を間違いなく開き、各次元の意味を完全に不透明にします。
0.65
Transformer-based embeddings like BERT [Devlin et al , 2018] perform also word sense disambiaguation because they create a different vector for each different word meaning, but still remain opaque.
BERT (Devlin et al , 2018)のようなトランスフォーマーベースの埋め込みは、異なる単語の意味に対して異なるベクトルを生成するため、ワードセンスの曖昧さも実現している。
0.66
Crucially, there are efforts towards explainable word embeddings, like EVE, a vector embedding technique which is built upon the structure of Wikipedia and exploits the Wikipedia hierarchy to represent a concept using human-readable labels [Qureshi and Greene, 2019].
EVEは、ウィキペディアの構造に基づいて構築され、ウィキペディア階層を利用して、人間の読みやすいラベル(Qureshi and Greene, 2019)を使った概念を表現するベクター埋め込み技術である。 訳抜け防止モード: 重要なのは、eveのような、説明可能な単語埋め込みへの取り組みだ。 ベクトル埋め込み技術 wikipediaの構造に基づいています wikipediaの階層を活用し 人間が読めるラベル[qureshi and greene, 2019]を使って概念を表現する。
0.73
Other techniques, like Layerwise Relevance Propagation, try to determine which features in a particular input vector contribute most to a word embedding’s output [S¸ enel et al , 2018], providing some clues for model interpretation but without naming a feature.
Existing word embedding learning algorithms typically only use the contexts of words but ignore the sentiment of texts.
既存の単語埋め込み学習アルゴリズムは通常、単語の文脈のみを使用するが、テキストの感情を無視する。
0.76
There are proposals for adding sentiment-
感情を加えるための提案がある。
0.53
英語(論文から抽出)
日本語訳
スコア
specific information into word embeddings [Tang et al., 2015] and this kind of technique could be exploited also for making the embeddings more transparent.
単語埋め込みの具体的な情報[tang et al., 2015]とこの種のテクニックは、埋め込みをより透明にするためにも活用できる。
0.75
In this technical report, we propose an algorithm, called Lex2vec that exploits lexical resources, like LIWC [Tausczik and Pennebaker, 2010] or NRC [Mohammad et al , 2013] to inject information into word embeddings and name the embedding dimensions by means of distant supervision.
本技術報告では,LIWC [Tausczik and Pennebaker, 2010] や NRC [Mohammad et al , 2013] のような語彙リソースを利用して,単語埋め込みに情報を注入し,遠隔監視によって埋め込み寸法を命名するアルゴリズム Lex2vec を提案する。
0.82
In the next section we will present the algorithm, the data and the lexical resources used for the evaluation.
次のセクションでは、評価に使用されるアルゴリズム、データ、および語彙リソースを紹介します。
0.66
2 Algorithm, Data and Evaluation
2 アルゴリズム,データおよび評価
0.80
1, requires a lexical resource @l and takes as input a word-embedding dictionary @d, produced with Word2Vec or GloVe.
All the values in the wordembedding vector must be normalized between 0 and 1.
wordembedding ベクトルのすべての値は 0 から 1 の間で正規化する必要があります。
0.74
Then the algorithm extracts the header with the unnamed embeddings @h, and count how many dimensions there are $n.
その後、アルゴリズムは、名前のない埋め込み@hでヘッダを抽出し、n の次元の数をカウントする。
0.74
Then for each line $i in the dictionary @d the algorithm splits the embedding vector @e, takes the word $w (escaping meta-characters if needed) and check the lexical resource for the corresponding word label(s) @lw.
If the word label is found, a for loop evaluates if each value of the embedding vector is greater than a threshold theta or lower than 1 minus theta, and in the case it is, the algorithm maps the label to the corresponding dimension in the header @h[$j], concatenating multiple labels.
The threshold theta is a parameter that allows us to select the most informative words, the ones that have an embedding score in the highest or in the lowest percentile.
There are many techniques that can filter labels – i.e.
ラベルをフィルターできる技術はたくさんあります。
0.61
a simple limit to the concatenation or a threshold on the ranking of most frequent labels per dimension – but our goal here is to experiment with the theta parameter without filtering techniques, to optimize the number labels (too many labels decrease readability) and reduce the ratio of unnamed dimensions.
Table 1: Results of the evaluation. Figure 1: Flowchart of the algorithm Lex2vec.
表1:評価の結果。 図1: アルゴリズム Lex2vec のフローチャート。
0.73
The Algorithm, depicted as a flowchart in Figure
図中のフローチャートとして表現されたアルゴリズム
0.90
with ACE2004, a corpus for information extraction from news [Doddington et al , 2004], we extracted word embeddings with Word2vec and applied the Lex2vec algorithm with (a small version of) LIWC
ニュース[Doddington et al , 2004]から情報抽出のためのコーパスであるACE2004を使って、Word2vecで単語埋め込みを抽出し、Lex2vecアルゴリズムを(小さなバージョンの)LIWCで適用しました。
0.73
英語(論文から抽出)
日本語訳
スコア
(about 500 words) and NRC (about 6400 words) to map words to linguistic labels.
(約500語)とNRC(約6400語)で単語を言語ラベルにマップします。
0.79
Results, reported in Table 1, show that, as the theta parameter increases, the number of labels per dimension decreases, making the name of the embedding more readable, but the percentage of dimensions that remain unnamed increases as well.
Our conclusion is that the algorithm is suitable for the explainability of word embeddings, and the theta parameter should be around 0.75, suggesting that some strategy to limit the concatenation of labels or select the best ones is necessary, especially with larger lexical resources.
[Pustejovsky and Jezek, 2008] James Pustejovsky and Elisabetta Jezek.
[Pustejovsky and Jezek, 2008]James Pustejovsky and Elisabetta Jezek。
0.82
Semantic coercion in language: Beyond distributional analysis.
言語におけるセマンティック・コアシオン:分布解析を超えて。
0.62
Italian Journal of Linguistics, 20(1):175–208, 2008.
イタリアの言語学雑誌, 20(1):175–208, 2008
0.77
[Qureshi and Greene, 2019] M Atif Qureshi and Derek Greene.
[Qureshi and Greene, 2019]M Atif QureshiとDerek Greene。
0.80
Eve: explainable vector based embedding technique using wikipedia.
Eve:wikipediaを使った説明可能なベクトルベースの埋め込み技術。
0.71
Journal of Intelligent Information Systems, 53(1):137– 165, 2019.
Journal of Intelligent Information Systems, 53(1):137– 165, 2019。
0.92
[S¸ enel et al , 2018] L¨utfi Kerem S¸ enel, Ihsan Utlu, Veysel Y¨ucesoy, Aykut Koc, and Tolga Cukur.
Ihsan Utlu, Veysel Y sucesoy, Aykut Koc, Tolga Cukur. [S s enel et al , 2018] L sutfi Kerem S s enel, Ihsan Utlu, Veysel Y sucesoy, Aykut Koc, Tolga Cukur.
0.65
Semantic structure and interpretability of word embeddings.
単語埋め込みのセマンティック構造と解釈可能性
0.74
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10):1769– 1779, 2018.
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10):1769–1779, 2018
0.93
[Smirnova and Cudr´e-Mauroux, 2018] Alisa
[Smirnova and Cudr ́e-Mauroux, 2018] Alisa
0.81
Smirnova and Philippe Cudr´e-Mauroux.
SmirnovaとPhilippe Cudr ́e-Mauroux。
0.64
Relation extraction using distant supervision: A survey.
遠隔監視による関係抽出:調査。
0.68
ACM Computing Surveys (CSUR), 51(5):1–35, 2018.
ACM Computing Surveys (CSUR), 51(5):1-35, 2018。
0.89
[Tang et al , 2015] Duyu Tang, Furu Wei, Bing Qin, Nan Yang, Ting Liu, and Ming Zhou.
[Tang et al , 2015] Duyu Tang、Furu Wei、Bing Qin、Nan Yang、Ting Liu、Ming Zhou。
0.72
Sentiment embeddings with applications to sentiment analysis.
感性埋め込みと感情分析への応用
0.63
IEEE transactions on knowledge and data Engineering, 28(2):496–509, 2015.
IEEE trade on knowledge and data Engineering, 28(2):496–509, 2015
0.87
[Tausczik and Pennebaker, 2010] Yla R Tausczik and James W Pennebaker.
[Tausczik and Pennebaker, 2010]Yla R TausczikとJames W Pennebaker。
0.83
The psychological meaning of words: Liwc and computerized text analysis methods.
単語の心理的意味:Liwcおよびコンピュータ化されたテキスト分析方法。
0.80
Journal of Language and Social Psychology, 29(1):24–54, 2010.
Journal of Language and Social Psychology, 29(1):24–54, 2010年。