RuDSI: graph-based word sense induction dataset for Russian
- URL: http://arxiv.org/abs/2209.13750v1
- Date: Wed, 28 Sep 2022 00:08:24 GMT
- Title: RuDSI: graph-based word sense induction dataset for Russian
- Authors: Anna Aksenova, Ekaterina Gavrishina, Elisey Rykov, Andrey Kutuzov
- Abstract summary: RuDSI is a new benchmark for word sense induction (WSI) in Russian.
It is completely data-driven, with no external word senses imposed on annotators.
- Score: 1.997704019887898
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present RuDSI, a new benchmark for word sense induction (WSI) in Russian.
The dataset was created using manual annotation and semi-automatic clustering
of Word Usage Graphs (WUGs). Unlike prior WSI datasets for Russian, RuDSI is
completely data-driven (based on texts from Russian National Corpus), with no
external word senses imposed on annotators. Depending on the parameters of
graph clustering, different derivative datasets can be produced from raw
annotation. We report the performance that several baseline WSI methods obtain
on RuDSI and discuss possibilities for improving these scores.
Related papers
- The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design [39.80182519545138]
This paper focuses on research related to embedding models in the Russian language.
It introduces a new Russian-focused embedding model called ru-en-RoSBERTa and the ruMTEB benchmark.
arXiv Detail & Related papers (2024-08-22T15:53:23Z) - Semantic Change Detection for the Romanian Language [0.5202524136984541]
We analyze different strategies to create static and contextual word embedding models on real-world datasets.
We first evaluate both word embedding models on an English dataset (SEMEVAL-CCOHA) and then on a Romanian dataset.
The experimental results show that, depending on the corpus, the most important factors to consider are the choice of model and the distance to calculate a score for detecting semantic change.
arXiv Detail & Related papers (2023-08-23T13:37:02Z) - A big data approach towards sarcasm detection in Russian [0.0]
We present a set of deterministic algorithms for Russian inflection and automated text synthesis.
These algorithms are implemented in a publicly available web-service www.passare.ru.
arXiv Detail & Related papers (2023-06-01T08:34:26Z) - Characterizing and Measuring Linguistic Dataset Drift [65.28821163863665]
We propose three dimensions of linguistic dataset drift: vocabulary, structural, and semantic drift.
These dimensions correspond to content word frequency divergences, syntactic divergences, and meaning changes not captured by word frequencies.
We find that our drift metrics are more effective than previous metrics at predicting out-of-domain model accuracies.
arXiv Detail & Related papers (2023-05-26T17:50:51Z) - Retrieval-based Disentangled Representation Learning with Natural
Language Supervision [61.75109410513864]
We present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning.
Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish intrinsic dimensions that capture characteristics within data through its natural language counterpart, thus disentanglement.
arXiv Detail & Related papers (2022-12-15T10:20:42Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - Always Keep your Target in Mind: Studying Semantics and Improving
Performance of Neural Lexical Substitution [124.99894592871385]
We present a large-scale comparative study of lexical substitution methods employing both old and most recent language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly.
arXiv Detail & Related papers (2022-06-07T16:16:19Z) - FEWS: Large-Scale, Low-Shot Word Sense Disambiguation with the
Dictionary [43.32179344258548]
Current models for Word Sense Disambiguation (WSD) struggle to disambiguate rare senses.
This paper introduces FEWS, a new low-shot WSD dataset automatically extracted from example sentences in Wiktionary.
arXiv Detail & Related papers (2021-02-16T07:13:34Z) - Graph-to-Sequence Neural Machine Translation [79.0617920270817]
We propose a graph-based SAN-based NMT model called Graph-Transformer.
Subgraphs are put into different groups according to their orders, and every group of subgraphs respectively reflect different levels of dependency between words.
Our method can effectively boost the Transformer with an improvement of 1.1 BLEU points on WMT14 English-German dataset and 1.0 BLEU points on IWSLT14 German-English dataset.
arXiv Detail & Related papers (2020-09-16T06:28:58Z) - Dataset for Automatic Summarization of Russian News [0.0]
We present Gazeta, the first dataset for summarization of Russian news.
We demonstrate that the dataset is a valid task for methods of text summarization for Russian.
arXiv Detail & Related papers (2020-06-19T10:44:06Z) - Syntax-aware Data Augmentation for Neural Machine Translation [76.99198797021454]
We propose a novel data augmentation strategy for neural machine translation.
We set sentence-specific probability for word selection by considering their roles in sentence.
Our proposed method is evaluated on WMT14 English-to-German dataset and IWSLT14 German-to-English dataset.
arXiv Detail & Related papers (2020-04-29T13:45:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.