Unstable Grounds for Beautiful Trees? Testing the Robustness of Concept Translations in the Compilation of Multilingual Wordlists
- URL: http://arxiv.org/abs/2503.00464v1
- Date: Sat, 01 Mar 2025 12:16:45 GMT
- Title: Unstable Grounds for Beautiful Trees? Testing the Robustness of Concept Translations in the Compilation of Multilingual Wordlists
- Authors: David Snee, Luca Ciucci, Arne Rubehn, Kellen Parker van Dam, Johann-Mattis List,
- Abstract summary: We investigate the variation in concept translations in independently compiled wordlists from 10 dataset pairs covering 9 different language families.<n>On average, only 83% of all translations yield the same word form, while identical forms in terms of phonetic transcriptions can only be found in 23% of all cases.
- Score: 1.0136215038345011
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multilingual wordlists play a crucial role in comparative linguistics. While many studies have been carried out to test the power of computational methods for language subgrouping or divergence time estimation, few studies have put the data upon which these studies are based to a rigorous test. Here, we conduct a first experiment that tests the robustness of concept translation as an integral part of the compilation of multilingual wordlists. Investigating the variation in concept translations in independently compiled wordlists from 10 dataset pairs covering 9 different language families, we find that on average, only 83% of all translations yield the same word form, while identical forms in terms of phonetic transcriptions can only be found in 23% of all cases. Our findings can prove important when trying to assess the uncertainty of phylogenetic studies and the conclusions derived from them.
Related papers
- Lost in Translation -- Multilingual Misinformation and its Evolution [52.07628580627591]
This paper investigates the prevalence and dynamics of multilingual misinformation through an analysis of over 250,000 unique fact-checks spanning 95 languages.
We find that while the majority of misinformation claims are only fact-checked once, 11.7%, corresponding to more than 21,000 claims, are checked multiple times.
Using fact-checks as a proxy for the spread of misinformation, we find 33% of repeated claims cross linguistic boundaries.
arXiv Detail & Related papers (2023-10-27T12:21:55Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Sentiment Classification of Code-Switched Text using Pre-trained
Multilingual Embeddings and Segmentation [1.290382979353427]
We propose a multi-step natural language processing algorithm for code-switched sentiment analysis.
The proposed algorithm can be expanded for sentiment analysis of multiple languages with limited human expertise.
arXiv Detail & Related papers (2022-10-29T01:52:25Z) - Corpus Similarity Measures Remain Robust Across Diverse Languages [0.0]
This paper experiments with frequency-based corpus similarity measures across 39 languages using a register prediction task.
The goal is to quantify (i) the distance between different corpora from the same language and (ii) the homogeneity of individual corpora.
Results show that measures of corpus similarity retain their validity across different language families, writing systems, and types of morphology.
arXiv Detail & Related papers (2022-06-09T08:17:16Z) - LyS_ACoru\~na at SemEval-2022 Task 10: Repurposing Off-the-Shelf Tools
for Sentiment Analysis as Semantic Dependency Parsing [10.355938901584567]
This paper addresses the problem of structured sentiment analysis using a bi-affine semantic dependency.
For the monolingual setup, we considered: (i) training on a single treebank, and (ii) relaxing the setup by training on treebanks coming from different languages.
For the zero-shot setup and a given target treebank, we relied on: (i) a word-level translation of available treebanks in other languages to get noisy, unlikely-grammatical, but annotated data.
In the post-evaluation phase, we also trained cross-lingual models that simply merged all the English tree
arXiv Detail & Related papers (2022-04-27T10:21:28Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus.
We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z) - When is BERT Multilingual? Isolating Crucial Ingredients for
Cross-lingual Transfer [15.578267998149743]
We show that the absence of sub-word overlap significantly affects zero-shot transfer when languages differ in their word order.
There is a strong correlation between transfer performance and word embedding alignment between languages.
Our results call for focus in multilingual models on explicitly improving word embedding alignment between languages.
arXiv Detail & Related papers (2021-10-27T21:25:39Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Comparative Analysis of Word Embeddings for Capturing Word Similarities [0.0]
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks.
Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings.
selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.
arXiv Detail & Related papers (2020-05-08T01:16:03Z) - Knowledge Distillation for Multilingual Unsupervised Neural Machine
Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs.
UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time.
In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.