An Algorithm for Fuzzification of WordNets, Supported by a Mathematical
Proof
- URL: http://arxiv.org/abs/2006.04042v1
- Date: Sun, 7 Jun 2020 04:47:40 GMT
- Title: An Algorithm for Fuzzification of WordNets, Supported by a Mathematical
Proof
- Authors: Sayyed-Ali Hossayni, Mohammad-R Akbarzadeh-T, Diego Reforgiato
Recupero, Aldo Gangemi, Esteve Del Acebo, Josep Llu\'is de la Rosa i Esteva
- Abstract summary: We present an algorithm for constructing fuzzy versions of WLDs of any language.
We publish online the fuzzified version of English WordNet (FWN)
- Score: 3.684688928766659
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: WordNet-like Lexical Databases (WLDs) group English words into sets of
synonyms called "synsets." Although the standard WLDs are being used in many
successful Text-Mining applications, they have the limitation that word-senses
are considered to represent the meaning associated to their corresponding
synsets, to the same degree, which is not generally true. In order to overcome
this limitation, several fuzzy versions of synsets have been proposed. A common
trait of these studies is that, to the best of our knowledge, they do not aim
to produce fuzzified versions of the existing WLD's, but build new WLDs from
scratch, which has limited the attention received from the Text-Mining
community, many of whose resources and applications are based on the existing
WLDs. In this study, we present an algorithm for constructing fuzzy versions of
WLDs of any language, given a corpus of documents and a word-sense
disambiguation (WSD) system for that language. Then, using the
Open-American-National-Corpus and UKB WSD as algorithm inputs, we construct and
publish online the fuzzified version of English WordNet (FWN). We also propose
a theoretical (mathematical) proof of the validity of its results.
Related papers
- Deep Emotions Across Languages: A Novel Approach for Sentiment
Propagation in Multilingual WordNets [4.532887563053358]
This paper introduces two new techniques for automatically propagating sentiment annotations from a partially annotated WordNet to its entirety and to a WordNet in a different language.
We evaluated the proposed MSSE+CLDNS method extensively using Princeton WordNet and Polish WordNet, which have many inter-lingual relations.
Our results show that the MSSE+CLDNS method outperforms existing propagation methods, indicating its effectiveness in enriching WordNets with emotional metadata across multiple languages.
arXiv Detail & Related papers (2023-12-07T21:44:14Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - Towards Effective Disambiguation for Machine Translation with Large
Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences"
Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z) - Always Keep your Target in Mind: Studying Semantics and Improving
Performance of Neural Lexical Substitution [124.99894592871385]
We present a large-scale comparative study of lexical substitution methods employing both old and most recent language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly.
arXiv Detail & Related papers (2022-06-07T16:16:19Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Interval Probabilistic Fuzzy WordNet [8.396691008449704]
We present an algorithm for constructing the Interval Probabilistic Fuzzy (IPF) synsets in any language.
We constructed and published the IPF synsets of WordNet for English language.
arXiv Detail & Related papers (2021-04-04T17:28:37Z) - Deconstructing word embedding algorithms [17.797952730495453]
We propose a retrospective on some of the most well-known word embedding algorithms.
In this work, we deconstruct Word2vec, GloVe, and others, into a common form, unveiling some of the common conditions that seem to be required for making performant word embeddings.
arXiv Detail & Related papers (2020-11-12T14:23:35Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - A Comparative Study of Lexical Substitution Approaches based on Neural
Language Models [117.96628873753123]
We present a large-scale comparative study of popular neural language and masked language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further improved if information about the target word is injected properly.
arXiv Detail & Related papers (2020-05-29T18:43:22Z) - Language-Independent Tokenisation Rivals Language-Specific Tokenisation
for Word Similarity Prediction [12.376752724719005]
Language-independent tokenisation (LIT) methods do not require labelled language resources or lexicons.
Language-specific tokenisation (LST) methods have a long and established history, and are developed using carefully created lexicons and training resources.
We empirically compare the two approaches using semantic similarity measurement as an evaluation task across a diverse set of languages.
arXiv Detail & Related papers (2020-02-25T16:24:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.