Deconstructing word embedding algorithms
- URL: http://arxiv.org/abs/2011.07013v1
- Date: Thu, 12 Nov 2020 14:23:35 GMT
- Title: Deconstructing word embedding algorithms
- Authors: Kian Kenyon-Dean, Edward Newell, Jackie Chi Kit Cheung
- Abstract summary: We propose a retrospective on some of the most well-known word embedding algorithms.
In this work, we deconstruct Word2vec, GloVe, and others, into a common form, unveiling some of the common conditions that seem to be required for making performant word embeddings.
- Score: 17.797952730495453
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Word embeddings are reliable feature representations of words used to obtain
high quality results for various NLP applications. Uncontextualized word
embeddings are used in many NLP tasks today, especially in resource-limited
settings where high memory capacity and GPUs are not available. Given the
historical success of word embeddings in NLP, we propose a retrospective on
some of the most well-known word embedding algorithms. In this work, we
deconstruct Word2vec, GloVe, and others, into a common form, unveiling some of
the common conditions that seem to be required for making performant word
embeddings. We believe that the theoretical findings in this paper can provide
a basis for more informed development of future models.
Related papers
- Word Embeddings for Banking Industry [0.0]
Bank-specific word embeddings could be a good stand-alone source or a complement to other widely available embeddings.
This paper explores the idea of creating a bank-specific word embeddings and evaluates them against other sources of word embeddings such as GloVe and BERT.
arXiv Detail & Related papers (2023-06-02T01:00:44Z) - Taxonomy Enrichment with Text and Graph Vector Representations [61.814256012166794]
We address the problem of taxonomy enrichment which aims at adding new words to the existing taxonomy.
We present a new method that allows achieving high results on this task with little effort.
We achieve state-of-the-art results across different datasets and provide an in-depth error analysis of mistakes.
arXiv Detail & Related papers (2022-01-21T09:01:12Z) - Between words and characters: A Brief History of Open-Vocabulary
Modeling and Tokenization in NLP [22.772546707304766]
We show how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated.
We conclude that there is and likely will never be a silver bullet singular solution for all applications.
arXiv Detail & Related papers (2021-12-20T13:04:18Z) - Fast Extraction of Word Embedding from Q-contexts [17.370344754614518]
We show that with merely a small fraction of contexts (Q-contexts) which are typical in the whole corpus (and their mutual information with words), one can construct high-quality word embedding with negligible errors.
We present an efficient and effective WEQ method, which is capable of extracting word embedding directly from these typical contexts.
arXiv Detail & Related papers (2021-09-15T05:14:31Z) - DocNLI: A Large-scale Dataset for Document-level Natural Language
Inference [55.868482696821815]
Natural language inference (NLI) is formulated as a unified framework for solving various NLP problems.
This work presents DocNLI -- a newly-constructed large-scale dataset for document-level NLI.
arXiv Detail & Related papers (2021-06-17T13:02:26Z) - Obtaining Better Static Word Embeddings Using Contextual Embedding
Models [53.86080627007695]
Our proposed distillation method is a simple extension of CBOW-based training.
As a side-effect, our approach also allows a fair comparison of both contextual and static embeddings.
arXiv Detail & Related papers (2021-06-08T12:59:32Z) - Meta-Embeddings for Natural Language Inference and Semantic Similarity
tasks [0.0]
Word Representations form the core component for almost all advanced Natural Language Processing (NLP) applications.
In this paper, we propose to use Meta Embedding derived from few State-of-the-Art (SOTA) models to efficiently tackle mainstream NLP tasks.
arXiv Detail & Related papers (2020-12-01T16:58:01Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z) - Attention Word Embedding [23.997145283950346]
We introduce the Attention Word Embedding (AWE) model, which integrates the attention mechanism into the CBOW model.
We also propose AWE-S, which incorporates subword information.
We demonstrate that AWE and AWE-S outperform the state-of-the-art word embedding models both on a variety of word similarity datasets.
arXiv Detail & Related papers (2020-06-01T14:47:48Z) - Word Sense Disambiguation for 158 Languages using Word Embeddings Only [80.79437083582643]
Disambiguation of word senses in context is easy for humans, but a major challenge for automatic approaches.
We present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory.
We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings.
arXiv Detail & Related papers (2020-03-14T14:50:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.