Computational Approaches for Integrating out Subjectivity in Cognate Synonym Selection
- URL: http://arxiv.org/abs/2404.19328v2
- Date: Wed, 5 Jun 2024 15:56:26 GMT
- Title: Computational Approaches for Integrating out Subjectivity in Cognate Synonym Selection
- Authors: Luise Häuser, Gerhard Jäger, Alexandros Stamatakis,
- Abstract summary: In the early days of language phylogenetics it was recommended to select one synonym only.
We show that binary character matrices do allow for representing the entire dataset including all synonyms.
We also make available a Python interface for generating all of the above character matrix types for cognate data provided in CLDF format.
- Score: 45.14832807541816
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Working with cognate data involves handling synonyms, that is, multiple words that describe the same concept in a language. In the early days of language phylogenetics it was recommended to select one synonym only. However, as we show here, binary character matrices, which are used as input for computational methods, do allow for representing the entire dataset including all synonyms. Here we address the question how one can and if one should include all synonyms or whether it is preferable to select synonyms a priori. To this end, we perform maximum likelihood tree inferences with the widely used RAxML-NG tool and show that it yields plausible trees when all synonyms are used as input. Furthermore, we show that a priori synonym selection can yield topologically substantially different trees and we therefore advise against doing so. To represent cognate data including all synonyms, we introduce two types of character matrices beyond the standard binary ones: probabilistic binary and probabilistic multi-valued character matrices. We further show that it is dataset-dependent for which character matrix type the inferred RAxML-NG tree is topologically closest to the gold standard. We also make available a Python interface for generating all of the above character matrix types for cognate data provided in CLDF format.
Related papers
- Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - A Benchmark and Scoring Algorithm for Enriching Arabic Synonyms [0.0]
Given a mono/multilingual synset and a threshold (a fuzzy value [0-1]), our goal is to extract new synonyms above this threshold from existing lexicons.
The dataset consists of 3K candidate synonyms for 500 synsets.
Our evaluations show that the algorithm behaves like a linguist and its fuzzy values are close to those proposed by linguists.
arXiv Detail & Related papers (2023-02-04T20:30:32Z) - Chinese Word Sense Embedding with SememeWSD and Synonym Set [17.37973450772783]
We propose SememeWSD Synonym (SWSDS) model to assign a different vector to every sense of polysemous words.
We obtain top 10 synonyms of the word sense from OpenHowNet and calculate the average vector of synonyms as the vector of the word sense.
In experiments, We evaluate the SWSDS model on semantic similarity calculation with Gensim's wmdistance method.
arXiv Detail & Related papers (2022-06-29T03:42:03Z) - Always Keep your Target in Mind: Studying Semantics and Improving
Performance of Neural Lexical Substitution [124.99894592871385]
We present a large-scale comparative study of lexical substitution methods employing both old and most recent language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly.
arXiv Detail & Related papers (2022-06-07T16:16:19Z) - Generalized Funnelling: Ensemble Learning and Heterogeneous Document
Embeddings for Cross-Lingual Text Classification [78.83284164605473]
emphFunnelling (Fun) is a recently proposed method for cross-lingual text classification.
We describe emphGeneralized Funnelling (gFun) as a generalization of Fun.
We show that gFun substantially improves over Fun and over state-of-the-art baselines.
arXiv Detail & Related papers (2021-09-17T23:33:04Z) - Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem.
For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token.
We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z) - SemGloVe: Semantic Co-occurrences for GloVe from BERT [55.420035541274444]
GloVe learns word embeddings by leveraging statistical information from word co-occurrence matrices.
We propose SemGloVe, which distills semantic co-occurrences from BERT into static GloVe word embeddings.
arXiv Detail & Related papers (2020-12-30T15:38:26Z) - PBoS: Probabilistic Bag-of-Subwords for Generalizing Word Embedding [16.531103175919924]
We look into the task of emphgeneralizing word embeddings.
given a set of pre-trained word vectors over a finite vocabulary, the goal is to predict embedding vectors for out-of-vocabulary words.
We propose a model, along with an efficient algorithm, that simultaneously models subword segmentation and computes subword-based compositional word embedding.
arXiv Detail & Related papers (2020-10-21T08:11:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.