Morphological Skip-Gram: Using morphological knowledge to improve word
representation
- URL: http://arxiv.org/abs/2007.10055v2
- Date: Tue, 21 Jul 2020 09:01:52 GMT
- Title: Morphological Skip-Gram: Using morphological knowledge to improve word
representation
- Authors: Fl\'avio Santos, Hendrik Macedo, Thiago Bispo, Cleber Zanchettin
- Abstract summary: We propose a new method for training word embeddings by replacing the FastText bag of character n-grams for a bag of word morphemes.
The results show a competitive performance compared to FastText.
- Score: 2.0129974477913457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural language processing models have attracted much interest in the deep
learning community. This branch of study is composed of some applications such
as machine translation, sentiment analysis, named entity recognition, question
and answer, and others. Word embeddings are continuous word representations,
they are an essential module for those applications and are generally used as
input word representation to the deep learning models. Word2Vec and GloVe are
two popular methods to learn word embeddings. They achieve good word
representations, however, they learn representations with limited information
because they ignore the morphological information of the words and consider
only one representation vector for each word. This approach implies that
Word2Vec and GloVe are unaware of the word inner structure. To mitigate this
problem, the FastText model represents each word as a bag of characters
n-grams. Hence, each n-gram has a continuous vector representation, and the
final word representation is the sum of its characters n-grams vectors.
Nevertheless, the use of all n-grams character of a word is a poor approach
since some n-grams have no semantic relation with their words and increase the
amount of potentially useless information. This approach also increases the
training phase time. In this work, we propose a new method for training word
embeddings, and its goal is to replace the FastText bag of character n-grams
for a bag of word morphemes through the morphological analysis of the word.
Thus, words with similar context and morphemes are represented by vectors close
to each other. To evaluate our new approach, we performed intrinsic evaluations
considering 15 different tasks, and the results show a competitive performance
compared to FastText.
Related papers
- From Characters to Words: Hierarchical Pre-trained Language Model for
Open-vocabulary Language Understanding [22.390804161191635]
Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens.
This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes.
We introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach.
arXiv Detail & Related papers (2023-05-23T23:22:20Z) - Conversational Semantic Parsing using Dynamic Context Graphs [68.72121830563906]
We consider the task of conversational semantic parsing over general purpose knowledge graphs (KGs) with millions of entities, and thousands of relation-types.
We focus on models which are capable of interactively mapping user utterances into executable logical forms.
arXiv Detail & Related papers (2023-05-04T16:04:41Z) - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z) - Word-Level Representation From Bytes For Language Modeling [46.28198397863388]
Sub-word tokenization is not robust to noise and difficult to generalize to new languages.
We introduce a cross-attention network that builds word-level representation directly from bytes, and a sub-word level prediction based on word-level hidden states.
Byte2Word is on par with the strong sub-word baseline BERT but only takes up 10% of embedding size.
arXiv Detail & Related papers (2022-11-23T03:11:13Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z) - Deriving Word Vectors from Contextualized Language Models using
Topic-Aware Mention Selection [46.97185212695267]
We propose a method for learning word representations that follows this basic strategy.
We take advantage of contextualized language models (CLMs) rather than bags of word vectors to encode contexts.
We show that this simple strategy leads to high-quality word vectors, which are more predictive of semantic properties than word embeddings and existing CLM-based strategies.
arXiv Detail & Related papers (2021-06-15T08:02:42Z) - WOVe: Incorporating Word Order in GloVe Word Embeddings [0.0]
Defining a word as a vector makes it easy for machine learning algorithms to understand a text and extract information from it.
Word vector representations have been used in many applications such word synonyms, word analogy, syntactic parsing, and many others.
arXiv Detail & Related papers (2021-05-18T15:28:20Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z) - Robust and Consistent Estimation of Word Embedding for Bangla Language
by fine-tuning Word2Vec Model [1.2691047660244335]
We analyze word2vec model for learning word vectors and present the most effective word embedding for Bangla language.
We cluster the word vectors to examine the relational similarity of words for intrinsic evaluation and also use different word embeddings as the feature of news article for extrinsic evaluation.
arXiv Detail & Related papers (2020-10-26T08:00:48Z) - Embedding Words in Non-Vector Space with Unsupervised Graph Learning [33.51809615505692]
We introduce GraphGlove: unsupervised graph word representations which are learned end-to-end.
In our setting, each word is a node in a weighted graph and the distance between words is the shortest path distance between the corresponding nodes.
We show that our graph-based representations substantially outperform vector-based methods on word similarity and analogy tasks.
arXiv Detail & Related papers (2020-10-06T10:17:49Z) - Using Holographically Compressed Embeddings in Question Answering [0.0]
This research employs holographic compression of pre-trained embeddings to represent a token, its part-of-speech, and named entity type.
The implementation, in a modified question answering recurrent deep learning network, shows that semantic relationships are preserved, and yields strong performance.
arXiv Detail & Related papers (2020-07-14T18:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.