A Rule-based/BPSO Approach to Produce Low-dimensional Semantic Basis
Vectors Set
- URL: http://arxiv.org/abs/2111.12802v1
- Date: Wed, 24 Nov 2021 21:23:43 GMT
- Title: A Rule-based/BPSO Approach to Produce Low-dimensional Semantic Basis
Vectors Set
- Authors: Atefe Pakzad, Morteza Analoui
- Abstract summary: In explicit semantic vectors, each dimension corresponds to a word, so word vectors are interpretable.
In this research, we propose a new approach to obtain low-dimensional explicit semantic vectors.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We intend to generate low-dimensional explicit distributional semantic
vectors. In explicit semantic vectors, each dimension corresponds to a word, so
word vectors are interpretable. In this research, we propose a new approach to
obtain low-dimensional explicit semantic vectors. First, the proposed approach
considers the three criteria Word Similarity, Number of Zero, and Word
Frequency as features for the words in a corpus. Then, we extract some rules
for obtaining the initial basis words using a decision tree that is drawn based
on the three features. Second, we propose a binary weighting method based on
the Binary Particle Swarm Optimization algorithm that obtains N_B = 1000
context words. We also use a word selection method that provides N_S = 1000
context words. Third, we extract the golden words of the corpus based on the
binary weighting method. Then, we add the extracted golden words to the context
words that are selected by the word selection method as the golden context
words. We use the ukWaC corpus for constructing the word vectors. We use MEN,
RG-65, and SimLex-999 test sets to evaluate the word vectors. We report the
results compared to a baseline that uses 5k most frequent words in the corpus
as context words. The baseline method uses a fixed window to count the
co-occurrences. We obtain the word vectors using the 1000 selected context
words together with the golden context words. Our approach compared to the
Baseline method increases the Spearman correlation coefficient for the MEN,
RG-65, and SimLex-999 test sets by 4.66%, 14.73%, and 1.08%, respectively.
Related papers
- Contextualized Word Vector-based Methods for Discovering Semantic
Differences with No Training nor Word Alignment [17.229611956178818]
We propose methods for discovering semantic differences in words appearing in two corpora.
The key idea is that the coverage of meanings is reflected in the norm of its mean word vector.
We show these advantages for native and non-native English corpora and also for historical corpora.
arXiv Detail & Related papers (2023-05-19T08:27:17Z) - Simple, Interpretable and Stable Method for Detecting Words with Usage
Change across Corpora [54.757845511368814]
The problem of comparing two bodies of text and searching for words that differ in their usage arises often in digital humanities and computational social science.
This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large.
We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word.
arXiv Detail & Related papers (2021-12-28T23:46:00Z) - Semantic-Preserving Adversarial Text Attacks [85.32186121859321]
We propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models.
Our method achieves the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.
arXiv Detail & Related papers (2021-08-23T09:05:18Z) - Deriving Word Vectors from Contextualized Language Models using
Topic-Aware Mention Selection [46.97185212695267]
We propose a method for learning word representations that follows this basic strategy.
We take advantage of contextualized language models (CLMs) rather than bags of word vectors to encode contexts.
We show that this simple strategy leads to high-quality word vectors, which are more predictive of semantic properties than word embeddings and existing CLM-based strategies.
arXiv Detail & Related papers (2021-06-15T08:02:42Z) - WOVe: Incorporating Word Order in GloVe Word Embeddings [0.0]
Defining a word as a vector makes it easy for machine learning algorithms to understand a text and extract information from it.
Word vector representations have been used in many applications such word synonyms, word analogy, syntactic parsing, and many others.
arXiv Detail & Related papers (2021-05-18T15:28:20Z) - An Iterative Contextualization Algorithm with Second-Order Attention [0.40611352512781856]
We show how to combine the representations of words that make up a sentence into a cohesive whole.
Our algorithm starts with a presumably erroneous value of the context, and adjusts this value with respect to the tokens at hand.
Our models report strong results in several well-known text classification tasks.
arXiv Detail & Related papers (2021-03-03T05:34:50Z) - SemGloVe: Semantic Co-occurrences for GloVe from BERT [55.420035541274444]
GloVe learns word embeddings by leveraging statistical information from word co-occurrence matrices.
We propose SemGloVe, which distills semantic co-occurrences from BERT into static GloVe word embeddings.
arXiv Detail & Related papers (2020-12-30T15:38:26Z) - Robust and Consistent Estimation of Word Embedding for Bangla Language
by fine-tuning Word2Vec Model [1.2691047660244335]
We analyze word2vec model for learning word vectors and present the most effective word embedding for Bangla language.
We cluster the word vectors to examine the relational similarity of words for intrinsic evaluation and also use different word embeddings as the feature of news article for extrinsic evaluation.
arXiv Detail & Related papers (2020-10-26T08:00:48Z) - Word Rotator's Distance [50.67809662270474]
Key principle in assessing textual similarity is measuring the degree of semantic overlap between two texts by considering the word alignment.
We show that the norm of word vectors is a good proxy for word importance, and their angle is a good proxy for word similarity.
We propose a method that first decouples word vectors into their norm and direction, and then computes alignment-based similarity.
arXiv Detail & Related papers (2020-04-30T17:48:42Z) - Lexical Sememe Prediction using Dictionary Definitions by Capturing
Local Semantic Correspondence [94.79912471702782]
Sememes, defined as the minimum semantic units of human languages, have been proven useful in many NLP tasks.
We propose a Sememe Correspondence Pooling (SCorP) model, which is able to capture this kind of matching to predict sememes.
We evaluate our model and baseline methods on a famous sememe KB HowNet and find that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-01-16T17:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.