Semantic Change Detection for the Romanian Language
- URL: http://arxiv.org/abs/2308.12131v1
- Date: Wed, 23 Aug 2023 13:37:02 GMT
- Title: Semantic Change Detection for the Romanian Language
- Authors: Ciprian-Octavian Truic\u{a}, Victor Tudose and Elena-Simona Apostol
- Abstract summary: We analyze different strategies to create static and contextual word embedding models on real-world datasets.
We first evaluate both word embedding models on an English dataset (SEMEVAL-CCOHA) and then on a Romanian dataset.
The experimental results show that, depending on the corpus, the most important factors to consider are the choice of model and the distance to calculate a score for detecting semantic change.
- Score: 0.5202524136984541
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automatic semantic change methods try to identify the changes that appear
over time in the meaning of words by analyzing their usage in diachronic
corpora. In this paper, we analyze different strategies to create static and
contextual word embedding models, i.e., Word2Vec and ELMo, on real-world
English and Romanian datasets. To test our pipeline and determine the
performance of our models, we first evaluate both word embedding models on an
English dataset (SEMEVAL-CCOHA). Afterward, we focus our experiments on a
Romanian dataset, and we underline different aspects of semantic changes in
this low-resource language, such as meaning acquisition and loss. The
experimental results show that, depending on the corpus, the most important
factors to consider are the choice of model and the distance to calculate a
score for detecting semantic change.
Related papers
- Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus [0.0]
We present a dataset of 19th century American literary orthovariant tokens with a novel layer of human-annotated dialect group tags.
We find indications that the "dialect effect" produced by intentional orthographic variation employs multiple linguistic channels.
arXiv Detail & Related papers (2024-10-03T16:58:21Z) - Linguistic Fingerprint in Transformer Models: How Language Variation Influences Parameter Selection in Irony Detection [1.5807079236265718]
We aim to investigate how different English variations impact transformer-based models for irony detection.
Our results reveal several similarities between optimalworks, which provide insights into the linguistic variations that share strong resemblances and those that exhibit greater dissimilarities.
This study highlights the inherent structural similarities between models trained on different variants of the same language and also the critical role of parameter values in capturing these nuances.
arXiv Detail & Related papers (2024-06-04T14:09:36Z) - Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models [0.0]
This paper examines the impact of tokenization strategies and vocabulary sizes on the performance of Arabic language models.
Our study uncovers limited impacts of vocabulary size on model performance while keeping the model size unchanged.
Paper's recommendations include refining tokenization strategies to address dialect challenges, enhancing model robustness across diverse linguistic contexts, and expanding datasets to encompass the rich dialect based Arabic.
arXiv Detail & Related papers (2024-03-17T07:44:44Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - SChME at SemEval-2020 Task 1: A Model Ensemble for Detecting Lexical
Semantic Change [58.87961226278285]
This paper describes SChME, a method used in SemEval-2020 Task 1 on unsupervised detection of lexical semantic change.
SChME usesa model ensemble combining signals of distributional models (word embeddings) and wordfrequency models where each model casts a vote indicating the probability that a word sufferedsemantic change according to that feature.
arXiv Detail & Related papers (2020-12-02T23:56:34Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - NLP-CIC @ DIACR-Ita: POS and Neighbor Based Distributional Models for
Lexical Semantic Change in Diachronic Italian Corpora [62.997667081978825]
We present our systems and findings on unsupervised lexical semantic change for the Italian language.
The task is to determine whether a target word has evolved its meaning with time, only relying on raw-text from two time-specific datasets.
We propose two models representing the target words across the periods to predict the changing words using threshold and voting schemes.
arXiv Detail & Related papers (2020-11-07T11:27:18Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.