Homonymy Information for English WordNet
- URL: http://arxiv.org/abs/2212.08388v1
- Date: Fri, 16 Dec 2022 10:23:26 GMT
- Title: Homonymy Information for English WordNet
- Authors: Rowan Hall Maudslay and Simone Teufel
- Abstract summary: We exploit recent advances in language modelling to synthesise homonymy annotation for Princeton WordNet.
We pair definitions based on their proximity in an embedding space produced by a Transformer model.
Despite the simplicity of this approach, our best model attains an F1 of.97 on an evaluation set that we annotate.
- Score: 9.860944032009847
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: A widely acknowledged shortcoming of WordNet is that it lacks a distinction
between word meanings which are systematically related (polysemy), and those
which are coincidental (homonymy). Several previous works have attempted to
fill this gap, by inferring this information using computational methods. We
revisit this task, and exploit recent advances in language modelling to
synthesise homonymy annotation for Princeton WordNet. Previous approaches treat
the problem using clustering methods; by contrast, our method works by linking
WordNet to the Oxford English Dictionary, which contains the information we
need. To perform this alignment, we pair definitions based on their proximity
in an embedding space produced by a Transformer model. Despite the simplicity
of this approach, our best model attains an F1 of .97 on an evaluation set that
we annotate. The outcome of our work is a high-quality homonymy annotation
layer for Princeton WordNet, which we release.
Related papers
- Homonym Sense Disambiguation in the Georgian Language [49.1574468325115]
This research proposes a novel approach to the Word Sense Disambiguation (WSD) task in the Georgian language.
It is based on supervised fine-tuning of a pre-trained Large Language Model (LLM) on a dataset formed by filtering the Georgian Common Crawls corpus.
arXiv Detail & Related papers (2024-04-24T21:48:43Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs.
In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z) - Denoising Word Embeddings by Averaging in a Shared Space [34.175826109538676]
We introduce a new approach for smoothing and improving the quality of word embeddings.
We project all the models to a shared vector space using an efficient implementation of the Generalized Procrustes Analysis (GPA) procedure.
As the new representations are more stable and reliable, there is a noticeable improvement in rare word evaluations.
arXiv Detail & Related papers (2021-06-05T19:49:02Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Combining Neural Language Models for WordSense Induction [0.5199765487172326]
Word sense induction (WSI) is the problem of grouping occurrences of an ambiguous word according to the expressed sense of this word.
Recently a new approach to this task was proposed, which generates possible substitutes for the ambiguous word in a particular context.
In this work, we apply this approach to the Russian language and improve it in two ways.
arXiv Detail & Related papers (2020-06-23T17:57:25Z) - Lexical Sememe Prediction using Dictionary Definitions by Capturing
Local Semantic Correspondence [94.79912471702782]
Sememes, defined as the minimum semantic units of human languages, have been proven useful in many NLP tasks.
We propose a Sememe Correspondence Pooling (SCorP) model, which is able to capture this kind of matching to predict sememes.
We evaluate our model and baseline methods on a famous sememe KB HowNet and find that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-01-16T17:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.