Moving Other Way: Exploring Word Mover Distance Extensions
- URL: http://arxiv.org/abs/2202.03119v2
- Date: Tue, 8 Feb 2022 16:33:15 GMT
- Title: Moving Other Way: Exploring Word Mover Distance Extensions
- Authors: Ilya Smirnov, Ivan P. Yamshchikov
- Abstract summary: The word mover's distance (WMD) is a popular semantic similarity metric for two texts.
This paper studies several possible extensions of WMD.
- Score: 7.195824023358536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The word mover's distance (WMD) is a popular semantic similarity metric for
two texts. This position paper studies several possible extensions of WMD. We
experiment with the frequency of words in the corpus as a weighting factor and
the geometry of the word vector space. We validate possible extensions of WMD
on six document classification datasets. Some proposed extensions show better
results in terms of the k-nearest neighbor classification error than WMD.
Related papers
- Improving word mover's distance by leveraging self-attention matrix [7.934452214142754]
The proposed method is based on the Fused Gromov-Wasserstein distance, which simultaneously considers the similarity of the word embedding and the SAM for calculating the optimal transport between two sentences.
Experiments demonstrate the proposed method enhances WMD and its variants in paraphrase identification with near-equivalent performance in semantic textual similarity.
arXiv Detail & Related papers (2022-11-11T14:25:08Z) - SynWMD: Syntax-aware Word Mover's Distance for Sentence Similarity
Evaluation [36.5590780726458]
Word Mover's Distance (WMD) computes the distance between words and models text similarity with the moving cost between words in two text sequences.
An improved WMD method using the syntactic parse tree, called Syntax-aware Word Mover's Distance (SynWMD), is proposed to address these two shortcomings in this work.
arXiv Detail & Related papers (2022-06-20T22:30:07Z) - Simple, Interpretable and Stable Method for Detecting Words with Usage
Change across Corpora [54.757845511368814]
The problem of comparing two bodies of text and searching for words that differ in their usage arises often in digital humanities and computational social science.
This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large.
We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word.
arXiv Detail & Related papers (2021-12-28T23:46:00Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Re-evaluating Word Mover's Distance [42.922307642413244]
Original study on word mover's distance (WMD) reported that WMD outperforms classical baselines.
We re-evaluate the performances of WMD and the classical baselines.
We find that WMD in high-dimensional spaces behaves more similarly to BOW than in low-dimensional spaces due to the curse of dimensionality.
arXiv Detail & Related papers (2021-05-30T01:35:03Z) - SemGloVe: Semantic Co-occurrences for GloVe from BERT [55.420035541274444]
GloVe learns word embeddings by leveraging statistical information from word co-occurrence matrices.
We propose SemGloVe, which distills semantic co-occurrences from BERT into static GloVe word embeddings.
arXiv Detail & Related papers (2020-12-30T15:38:26Z) - SChME at SemEval-2020 Task 1: A Model Ensemble for Detecting Lexical
Semantic Change [58.87961226278285]
This paper describes SChME, a method used in SemEval-2020 Task 1 on unsupervised detection of lexical semantic change.
SChME usesa model ensemble combining signals of distributional models (word embeddings) and wordfrequency models where each model casts a vote indicating the probability that a word sufferedsemantic change according to that feature.
arXiv Detail & Related papers (2020-12-02T23:56:34Z) - Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z) - Hybrid Improved Document-level Embedding (HIDE) [5.33024001730262]
We propose HIDE a Hybrid Improved Document level Embedding.
It incorporates domain information, parts of speech information and sentiment information into existing word embeddings such as GloVe and Word2Vec.
We show considerable improvement over the accuracy of existing pretrained word vectors such as GloVe and Word2Vec.
arXiv Detail & Related papers (2020-06-01T19:09:13Z) - Text classification with word embedding regularization and soft
similarity measure [0.20999222360659603]
Two word embedding regularization techniques were shown to reduce storage and memory costs, and to improve training speed, document processing speed, and task performance.
We show 39% average $k$NN test error reduction with regularized word embeddings compared to non-regularized word embeddings.
We also show that the SCM with regularized word embeddings significantly outperforms the WMD on text classification and is over 10,000 times faster.
arXiv Detail & Related papers (2020-03-10T22:07:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.