Simple, Interpretable and Stable Method for Detecting Words with Usage
Change across Corpora
- URL: http://arxiv.org/abs/2112.14330v1
- Date: Tue, 28 Dec 2021 23:46:00 GMT
- Title: Simple, Interpretable and Stable Method for Detecting Words with Usage
Change across Corpora
- Authors: Hila Gonen, Ganesh Jawahar, Djam\'e Seddah, Yoav Goldberg
- Abstract summary: The problem of comparing two bodies of text and searching for words that differ in their usage arises often in digital humanities and computational social science.
This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large.
We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word.
- Score: 54.757845511368814
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The problem of comparing two bodies of text and searching for words that
differ in their usage between them arises often in digital humanities and
computational social science. This is commonly approached by training word
embeddings on each corpus, aligning the vector spaces, and looking for words
whose cosine distance in the aligned space is large. However, these methods
often require extensive filtering of the vocabulary to perform well, and - as
we show in this work - result in unstable, and hence less reliable, results. We
propose an alternative approach that does not use vector space alignment, and
instead considers the neighbors of each word. The method is simple,
interpretable and stable. We demonstrate its effectiveness in 9 different
setups, considering different corpus splitting criteria (age, gender and
profession of tweet authors, time of tweet) and different languages (English,
French and Hebrew).
Related papers
- Unsupervised extraction of local and global keywords from a single text [0.0]
We propose an unsupervised, corpus-independent method to extract keywords from a single text.
It is based on the spatial distribution of words and the response of this distribution to a random permutation of words.
arXiv Detail & Related papers (2023-07-26T07:36:25Z) - Contextualized Word Vector-based Methods for Discovering Semantic
Differences with No Training nor Word Alignment [17.229611956178818]
We propose methods for discovering semantic differences in words appearing in two corpora.
The key idea is that the coverage of meanings is reflected in the norm of its mean word vector.
We show these advantages for native and non-native English corpora and also for historical corpora.
arXiv Detail & Related papers (2023-05-19T08:27:17Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Wasserstein Distance Regularized Sequence Representation for Text
Matching in Asymmetrical Domains [51.91456788949489]
We propose a novel match method tailored for text matching in asymmetrical domains, called WD-Match.
In WD-Match, a Wasserstein distance-based regularizer is defined to regularize the features vectors projected from different domains.
The training process of WD-Match amounts to a game that minimizes the matching loss regularized by the Wasserstein distance.
arXiv Detail & Related papers (2020-10-15T12:52:09Z) - Comparative Analysis of Word Embeddings for Capturing Word Similarities [0.0]
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks.
Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings.
selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.
arXiv Detail & Related papers (2020-05-08T01:16:03Z) - Word Rotator's Distance [50.67809662270474]
Key principle in assessing textual similarity is measuring the degree of semantic overlap between two texts by considering the word alignment.
We show that the norm of word vectors is a good proxy for word importance, and their angle is a good proxy for word similarity.
We propose a method that first decouples word vectors into their norm and direction, and then computes alignment-based similarity.
arXiv Detail & Related papers (2020-04-30T17:48:42Z) - Humpty Dumpty: Controlling Word Meanings via Corpus Poisoning [29.181547214915238]
We show that an attacker can control the "meaning" of new and existing words by changing their locations in the embedding space.
An attack on the embedding can affect diverse downstream tasks, demonstrating for the first time the power of data poisoning in transfer learning scenarios.
arXiv Detail & Related papers (2020-01-14T17:48:52Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.