Related papers: Robust and Consistent Estimation of Word Embedding for Bangla Language by fine-tuning Word2Vec Model

Robust and Consistent Estimation of Word Embedding for Bangla Language by fine-tuning Word2Vec Model

URL: http://arxiv.org/abs/2010.13404v3
Date: Mon, 3 May 2021 20:58:27 GMT
Title: Robust and Consistent Estimation of Word Embedding for Bangla Language by fine-tuning Word2Vec Model
Authors: Rifat Rahman
Abstract summary: We analyze word2vec model for learning word vectors and present the most effective word embedding for Bangla language. We cluster the word vectors to examine the relational similarity of words for intrinsic evaluation and also use different word embeddings as the feature of news article for extrinsic evaluation.
Score: 1.2691047660244335
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Word embedding or vector representation of word holds syntactical and semantic characteristics of a word which can be an informative feature for any machine learning-based models of natural language processing. There are several deep learning-based models for the vectorization of words like word2vec, fasttext, gensim, glove, etc. In this study, we analyze word2vec model for learning word vectors by tuning different hyper-parameters and present the most effective word embedding for Bangla language. For testing the performances of different word embeddings generated by fine-tuning of word2vec model, we perform both intrinsic and extrinsic evaluations. We cluster the word vectors to examine the relational similarity of words for intrinsic evaluation and also use different word embeddings as the feature of news article classifier for extrinsic evaluation. From our experiment, we discover that the word vectors with 300 dimensions, generated from "skip-gram" method of word2vec model using the sliding window size of 4, are giving the most robust vector representations for Bangla language.

Related papers

Backpack Language Models [108.65930795825416]
We present Backpacks, a new neural architecture that marries strong modeling performance with an interface for interpretability and control. We find that, after training, sense vectors specialize, each encoding a different aspect of a word. We present simple algorithms that intervene on sense vectors to perform controllable text generation and debiasing.
arXiv Detail & Related papers (2023-05-26T09:26:23Z)
CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents. We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary. We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z)
Tsetlin Machine Embedding: Representing Words Using Logical Expressions [10.825099126920028]
We introduce a Tsetlin Machine-based autoencoder that learns logical clauses self-supervised. The clauses consist of contextual words like "black," "cup," and "hot" to define other words like "coffee" We evaluate our embedding approach on several intrinsic and extrinsic benchmarks, outperforming GLoVe on six classification tasks.
arXiv Detail & Related papers (2023-01-02T15:02:45Z)
Word-Level Representation From Bytes For Language Modeling [46.28198397863388]
Sub-word tokenization is not robust to noise and difficult to generalize to new languages. We introduce a cross-attention network that builds word-level representation directly from bytes, and a sub-word level prediction based on word-level hidden states. Byte2Word is on par with the strong sub-word baseline BERT but only takes up 10% of embedding size.
arXiv Detail & Related papers (2022-11-23T03:11:13Z)
Deriving Word Vectors from Contextualized Language Models using Topic-Aware Mention Selection [46.97185212695267]
We propose a method for learning word representations that follows this basic strategy. We take advantage of contextualized language models (CLMs) rather than bags of word vectors to encode contexts. We show that this simple strategy leads to high-quality word vectors, which are more predictive of semantic properties than word embeddings and existing CLM-based strategies.
arXiv Detail & Related papers (2021-06-15T08:02:42Z)
WOVe: Incorporating Word Order in GloVe Word Embeddings [0.0]
Defining a word as a vector makes it easy for machine learning algorithms to understand a text and extract information from it. Word vector representations have been used in many applications such word synonyms, word analogy, syntactic parsing, and many others.
arXiv Detail & Related papers (2021-05-18T15:28:20Z)
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus. Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size. We propose a fully compositional output embedding layer for language models. To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
Morphological Skip-Gram: Using morphological knowledge to improve word representation [2.0129974477913457]
We propose a new method for training word embeddings by replacing the FastText bag of character n-grams for a bag of word morphemes. The results show a competitive performance compared to FastText.
arXiv Detail & Related papers (2020-07-20T12:47:36Z)
Attention Word Embedding [23.997145283950346]
We introduce the Attention Word Embedding (AWE) model, which integrates the attention mechanism into the CBOW model. We also propose AWE-S, which incorporates subword information. We demonstrate that AWE and AWE-S outperform the state-of-the-art word embedding models both on a variety of word similarity datasets.
arXiv Detail & Related papers (2020-06-01T14:47:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.