Related papers: Deconstructing word embedding algorithms

Deconstructing word embedding algorithms

URL: http://arxiv.org/abs/2011.07013v1
Date: Thu, 12 Nov 2020 14:23:35 GMT
Title: Deconstructing word embedding algorithms
Authors: Kian Kenyon-Dean, Edward Newell, Jackie Chi Kit Cheung
Abstract summary: We propose a retrospective on some of the most well-known word embedding algorithms. In this work, we deconstruct Word2vec, GloVe, and others, into a common form, unveiling some of the common conditions that seem to be required for making performant word embeddings.
Score: 17.797952730495453
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Word embeddings are reliable feature representations of words used to obtain high quality results for various NLP applications. Uncontextualized word embeddings are used in many NLP tasks today, especially in resource-limited settings where high memory capacity and GPUs are not available. Given the historical success of word embeddings in NLP, we propose a retrospective on some of the most well-known word embedding algorithms. In this work, we deconstruct Word2vec, GloVe, and others, into a common form, unveiling some of the common conditions that seem to be required for making performant word embeddings. We believe that the theoretical findings in this paper can provide a basis for more informed development of future models.

Related papers

Word Embeddings for Banking Industry [0.0]
Bank-specific word embeddings could be a good stand-alone source or a complement to other widely available embeddings. This paper explores the idea of creating a bank-specific word embeddings and evaluates them against other sources of word embeddings such as GloVe and BERT.
arXiv Detail & Related papers (2023-06-02T01:00:44Z)
Taxonomy Enrichment with Text and Graph Vector Representations [61.814256012166794]
We address the problem of taxonomy enrichment which aims at adding new words to the existing taxonomy. We present a new method that allows achieving high results on this task with little effort. We achieve state-of-the-art results across different datasets and provide an in-depth error analysis of mistakes.
arXiv Detail & Related papers (2022-01-21T09:01:12Z)
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP [22.772546707304766]
We show how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated. We conclude that there is and likely will never be a silver bullet singular solution for all applications.
arXiv Detail & Related papers (2021-12-20T13:04:18Z)
Fast Extraction of Word Embedding from Q-contexts [17.370344754614518]
We show that with merely a small fraction of contexts (Q-contexts) which are typical in the whole corpus (and their mutual information with words), one can construct high-quality word embedding with negligible errors. We present an efficient and effective WEQ method, which is capable of extracting word embedding directly from these typical contexts.
arXiv Detail & Related papers (2021-09-15T05:14:31Z)
DocNLI: A Large-scale Dataset for Document-level Natural Language Inference [55.868482696821815]
Natural language inference (NLI) is formulated as a unified framework for solving various NLP problems. This work presents DocNLI -- a newly-constructed large-scale dataset for document-level NLI.
arXiv Detail & Related papers (2021-06-17T13:02:26Z)
Obtaining Better Static Word Embeddings Using Contextual Embedding Models [53.86080627007695]
Our proposed distillation method is a simple extension of CBOW-based training. As a side-effect, our approach also allows a fair comparison of both contextual and static embeddings.
arXiv Detail & Related papers (2021-06-08T12:59:32Z)
Meta-Embeddings for Natural Language Inference and Semantic Similarity tasks [0.0]
Word Representations form the core component for almost all advanced Natural Language Processing (NLP) applications. In this paper, we propose to use Meta Embedding derived from few State-of-the-Art (SOTA) models to efficiently tackle mainstream NLP tasks.
arXiv Detail & Related papers (2020-12-01T16:58:01Z)
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus. Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks. Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it. In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z)
Attention Word Embedding [23.997145283950346]
We introduce the Attention Word Embedding (AWE) model, which integrates the attention mechanism into the CBOW model. We also propose AWE-S, which incorporates subword information. We demonstrate that AWE and AWE-S outperform the state-of-the-art word embedding models both on a variety of word similarity datasets.
arXiv Detail & Related papers (2020-06-01T14:47:48Z)
Word Sense Disambiguation for 158 Languages using Word Embeddings Only [80.79437083582643]
Disambiguation of word senses in context is easy for humans, but a major challenge for automatic approaches. We present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory. We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings.
arXiv Detail & Related papers (2020-03-14T14:50:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.