Detecting New Word Meanings: A Comparison of Word Embedding Models in
Spanish
- URL: http://arxiv.org/abs/2001.05285v1
- Date: Sun, 12 Jan 2020 21:54:52 GMT
- Title: Detecting New Word Meanings: A Comparison of Word Embedding Models in
Spanish
- Authors: Andr\'es Torres-Rivera and Juan-Manuel Torres-Moreno
- Abstract summary: Semantic neologisms (SN) are words that acquire a new word meaning while maintaining their form.
To detect SN in a semi-automatic way, we developed a system that implements a combination of the following strategies.
We examine the following word embedding models: Word2Vec, Sense2Vec, and FastText.
- Score: 1.5356167668895644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic neologisms (SN) are defined as words that acquire a new word meaning
while maintaining their form. Given the nature of this kind of neologisms, the
task of identifying these new word meanings is currently performed manually by
specialists at observatories of neology. To detect SN in a semi-automatic way,
we developed a system that implements a combination of the following
strategies: topic modeling, keyword extraction, and word sense disambiguation.
The role of topic modeling is to detect the themes that are treated in the
input text. Themes within a text give clues about the particular meaning of the
words that are used, for example: viral has one meaning in the context of
computer science (CS) and another when talking about health. To extract
keywords, we used TextRank with POS tag filtering. With this method, we can
obtain relevant words that are already part of the Spanish lexicon. We use a
deep learning model to determine if a given keyword could have a new meaning.
Embeddings that are different from all the known meanings (or topics) indicate
that a word might be a valid SN candidate. In this study, we examine the
following word embedding models: Word2Vec, Sense2Vec, and FastText. The models
were trained with equivalent parameters using Wikipedia in Spanish as corpora.
Then we used a list of words and their concordances (obtained from our database
of neologisms) to show the different embeddings that each model yields.
Finally, we present a comparison of these outcomes with the concordances of
each word to show how we can determine if a word could be a valid candidate for
SN.
Related papers
- Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models [88.07940818022468]
We take an initial step on measuring the role of shared semantics among subwords in the encoder-only multilingual language models (mLMs)
We form "semantic tokens" by merging the semantically similar subwords and their embeddings.
inspections on the grouped subwords show that they exhibit a wide range of semantic similarities.
arXiv Detail & Related papers (2024-11-07T08:38:32Z) - Review of Unsupervised POS Tagging and Its Implications on Language
Acquisition [0.0]
An ability that underlies human syntactic knowledge is determining which words can appear in the similar structures.
In exploring this process, we will review various engineering approaches whose goal is similar to that of a child's.
We will discuss common themes that support the advances in the models and their relevance for language acquisition.
arXiv Detail & Related papers (2023-12-15T19:31:00Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Always Keep your Target in Mind: Studying Semantics and Improving
Performance of Neural Lexical Substitution [124.99894592871385]
We present a large-scale comparative study of lexical substitution methods employing both old and most recent language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly.
arXiv Detail & Related papers (2022-06-07T16:16:19Z) - IRB-NLP at SemEval-2022 Task 1: Exploring the Relationship Between Words
and Their Semantic Representations [0.0]
We present our findings based on the descriptive, exploratory, and predictive data analysis conducted on the CODWOE dataset.
We give a detailed overview of the systems that we designed for Definition Modeling and Reverse Dictionary tasks.
arXiv Detail & Related papers (2022-05-13T18:15:20Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - Morphological Skip-Gram: Using morphological knowledge to improve word
representation [2.0129974477913457]
We propose a new method for training word embeddings by replacing the FastText bag of character n-grams for a bag of word morphemes.
The results show a competitive performance compared to FastText.
arXiv Detail & Related papers (2020-07-20T12:47:36Z) - Lexical Sememe Prediction using Dictionary Definitions by Capturing
Local Semantic Correspondence [94.79912471702782]
Sememes, defined as the minimum semantic units of human languages, have been proven useful in many NLP tasks.
We propose a Sememe Correspondence Pooling (SCorP) model, which is able to capture this kind of matching to predict sememes.
We evaluate our model and baseline methods on a famous sememe KB HowNet and find that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-01-16T17:30:36Z) - Humpty Dumpty: Controlling Word Meanings via Corpus Poisoning [29.181547214915238]
We show that an attacker can control the "meaning" of new and existing words by changing their locations in the embedding space.
An attack on the embedding can affect diverse downstream tasks, demonstrating for the first time the power of data poisoning in transfer learning scenarios.
arXiv Detail & Related papers (2020-01-14T17:48:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.