ELMo and BERT in semantic change detection for Russian
- URL: http://arxiv.org/abs/2010.03481v1
- Date: Wed, 7 Oct 2020 15:34:00 GMT
- Title: ELMo and BERT in semantic change detection for Russian
- Authors: Julia Rodina, Yuliya Trofimova, Andrey Kutuzov, Ekaterina Artemova
- Abstract summary: We study the effectiveness of contextualized embeddings for the task of diachronic semantic change detection for Russian language data.
Evaluation test sets consist of Russian nouns and adjectives annotated based on their occurrences in texts created in pre-Soviet, Soviet and post-Soviet time periods.
- Score: 4.389735175149927
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the effectiveness of contextualized embeddings for the task of
diachronic semantic change detection for Russian language data. Evaluation test
sets consist of Russian nouns and adjectives annotated based on their
occurrences in texts created in pre-Soviet, Soviet and post-Soviet time
periods. ELMo and BERT architectures are compared on the task of ranking
Russian words according to the degree of their semantic change over time. We
use several methods for aggregation of contextualized embeddings from these
architectures and evaluate their performance. Finally, we compare unsupervised
and supervised techniques in this task.
Related papers
- The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design [39.80182519545138]
This paper focuses on research related to embedding models in the Russian language.
It introduces a new Russian-focused embedding model called ru-en-RoSBERTa and the ruMTEB benchmark.
arXiv Detail & Related papers (2024-08-22T15:53:23Z) - Discourse Centric Evaluation of Machine Translation with a Densely
Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al.
We investigate the similarities and differences between the discourse structures of source and target languages.
We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z) - Retrofitting Multilingual Sentence Embeddings with Abstract Meaning
Representation [70.58243648754507]
We introduce a new method to improve existing multilingual sentence embeddings with Abstract Meaning Representation (AMR)
Compared with the original textual input, AMR is a structured semantic representation that presents the core concepts and relations in a sentence explicitly and unambiguously.
Experiment results show that retrofitting multilingual sentence embeddings with AMR leads to better state-of-the-art performance on both semantic similarity and transfer tasks.
arXiv Detail & Related papers (2022-10-18T11:37:36Z) - Russian SuperGLUE 1.1: Revising the Lessons not Learned by Russian NLP
models [53.95094814056337]
This paper presents Russian SuperGLUE 1.1, an updated benchmark styled after GLUE for Russian NLP models.
The new version includes a number of technical, user experience and methodological improvements.
We provide the integration of Russian SuperGLUE with a framework for industrial evaluation of the open-source models, MOROCCO.
arXiv Detail & Related papers (2022-02-15T23:45:30Z) - HistBERT: A Pre-trained Language Model for Diachronic Lexical Semantic
Analysis [3.2851864672627618]
We present a pre-trained BERT-based language model, HistBERT, trained on the balanced Corpus of Historical American English.
We report promising results in word similarity and semantic shift analysis.
arXiv Detail & Related papers (2022-02-08T02:53:48Z) - Three-part diachronic semantic change dataset for Russian [4.7566046630595755]
We present a manually annotated lexical semantic change dataset for Russian: RuShiftEval.
Its novelty is ensured by a single set of target words annotated for their diachronic semantic shifts across three time periods.
arXiv Detail & Related papers (2021-06-15T17:12:25Z) - Methods for Detoxification of Texts for the Russian Language [55.337471467610094]
We introduce the first study of automatic detoxification of Russian texts to combat offensive language.
We test two types of models - unsupervised approach that performs local corrections and supervised approach based on pretrained language GPT-2 model.
The results show that the tested approaches can be successfully used for detoxification, although there is room for improvement.
arXiv Detail & Related papers (2021-05-19T10:37:44Z) - RuSemShift: a dataset of historical lexical semantic change in Russian [3.261599248682794]
We present RuSemShift, a large-scale manually annotated test set for the task of semantic change modeling in Russian.
Target words were annotated by multiple crowd-source workers.
We report the performance of several distributional approaches on RuSemShift, achieving promising results.
arXiv Detail & Related papers (2020-10-13T14:54:05Z) - Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes.
An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences.
The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv Detail & Related papers (2020-10-12T19:42:25Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z) - Dataset for Automatic Summarization of Russian News [0.0]
We present Gazeta, the first dataset for summarization of Russian news.
We demonstrate that the dataset is a valid task for methods of text summarization for Russian.
arXiv Detail & Related papers (2020-06-19T10:44:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.