PBoS: Probabilistic Bag-of-Subwords for Generalizing Word Embedding
- URL: http://arxiv.org/abs/2010.10813v1
- Date: Wed, 21 Oct 2020 08:11:08 GMT
- Title: PBoS: Probabilistic Bag-of-Subwords for Generalizing Word Embedding
- Authors: Zhao Jinman, Shawn Zhong, Xiaomin Zhang, Yingyu Liang
- Abstract summary: We look into the task of emphgeneralizing word embeddings.
given a set of pre-trained word vectors over a finite vocabulary, the goal is to predict embedding vectors for out-of-vocabulary words.
We propose a model, along with an efficient algorithm, that simultaneously models subword segmentation and computes subword-based compositional word embedding.
- Score: 16.531103175919924
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We look into the task of \emph{generalizing} word embeddings: given a set of
pre-trained word vectors over a finite vocabulary, the goal is to predict
embedding vectors for out-of-vocabulary words, \emph{without} extra contextual
information. We rely solely on the spellings of words and propose a model,
along with an efficient algorithm, that simultaneously models subword
segmentation and computes subword-based compositional word embedding. We call
the model probabilistic bag-of-subwords (PBoS), as it applies bag-of-subwords
for all possible segmentations based on their likelihood. Inspections and affix
prediction experiment show that PBoS is able to produce meaningful subword
segmentations and subword rankings without any source of explicit morphological
knowledge. Word similarity and POS tagging experiments show clear advantages of
PBoS over previous subword-level models in the quality of generated word
embeddings across languages.
Related papers
- Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models [88.07940818022468]
We take an initial step on measuring the role of shared semantics among subwords in the encoder-only multilingual language models (mLMs)
We form "semantic tokens" by merging the semantically similar subwords and their embeddings.
inspections on the grouped subwords show that they exhibit a wide range of semantic similarities.
arXiv Detail & Related papers (2024-11-07T08:38:32Z) - Leading Whitespaces of Language Models' Subword Vocabulary Pose a Confound for Calculating Word Probabilities [15.073507986272027]
We argue that there is a confound posed by the most common method of aggregating subword probabilities into word probabilities.
This is due to the fact that tokens in the subword vocabulary of most language models have leading whitespaces.
We present a simple decoding technique to reaccount the probability of the trailing whitespace into that of the current word.
arXiv Detail & Related papers (2024-06-16T08:44:56Z) - Between words and characters: A Brief History of Open-Vocabulary
Modeling and Tokenization in NLP [22.772546707304766]
We show how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated.
We conclude that there is and likely will never be a silver bullet singular solution for all applications.
arXiv Detail & Related papers (2021-12-20T13:04:18Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - LexSubCon: Integrating Knowledge from Lexical Resources into Contextual
Embeddings for Lexical Substitution [76.615287796753]
We introduce LexSubCon, an end-to-end lexical substitution framework based on contextual embedding models.
This is achieved by combining contextual information with knowledge from structured lexical resources.
Our experiments show that LexSubCon outperforms previous state-of-the-art methods on LS07 and CoInCo benchmark datasets.
arXiv Detail & Related papers (2021-07-11T21:25:56Z) - Deriving Word Vectors from Contextualized Language Models using
Topic-Aware Mention Selection [46.97185212695267]
We propose a method for learning word representations that follows this basic strategy.
We take advantage of contextualized language models (CLMs) rather than bags of word vectors to encode contexts.
We show that this simple strategy leads to high-quality word vectors, which are more predictive of semantic properties than word embeddings and existing CLM-based strategies.
arXiv Detail & Related papers (2021-06-15T08:02:42Z) - Extending Multi-Sense Word Embedding to Phrases and Sentences for
Unsupervised Semantic Applications [34.71597411512625]
We propose a novel embedding method for a text sequence (a phrase or a sentence) where each sequence is represented by a distinct set of codebook embeddings.
Our experiments show that the per-sentence codebook embeddings significantly improve the performances in unsupervised sentence similarity and extractive summarization benchmarks.
arXiv Detail & Related papers (2021-03-29T04:54:28Z) - SemGloVe: Semantic Co-occurrences for GloVe from BERT [55.420035541274444]
GloVe learns word embeddings by leveraging statistical information from word co-occurrence matrices.
We propose SemGloVe, which distills semantic co-occurrences from BERT into static GloVe word embeddings.
arXiv Detail & Related papers (2020-12-30T15:38:26Z) - Supervised Understanding of Word Embeddings [1.160208922584163]
We have obtained supervised projections in the form of the linear keyword-level classifiers on word embeddings.
We have shown that the method creates interpretable projections of original embedding dimensions.
arXiv Detail & Related papers (2020-06-23T20:13:42Z) - On the Learnability of Concepts: With Applications to Comparing Word
Embedding Algorithms [0.0]
We introduce the notion of "concept" as a list of words that have shared semantic content.
We first use this notion to measure the learnability of concepts on pretrained word embeddings.
We then develop a statistical analysis of concept learnability, based on hypothesis testing and ROC curves, in order to compare the relative merits of various embedding algorithms.
arXiv Detail & Related papers (2020-06-17T14:25:36Z) - Lexical Sememe Prediction using Dictionary Definitions by Capturing
Local Semantic Correspondence [94.79912471702782]
Sememes, defined as the minimum semantic units of human languages, have been proven useful in many NLP tasks.
We propose a Sememe Correspondence Pooling (SCorP) model, which is able to capture this kind of matching to predict sememes.
We evaluate our model and baseline methods on a famous sememe KB HowNet and find that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-01-16T17:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.