The Cognate Data Bottleneck in Language Phylogenetics
- URL: http://arxiv.org/abs/2507.00911v1
- Date: Tue, 01 Jul 2025 16:14:20 GMT
- Title: The Cognate Data Bottleneck in Language Phylogenetics
- Authors: Luise Häuser, Alexandros Stamatakis,
- Abstract summary: Phylogenetic data analysis approaches that require larger datasets can not be applied to cognate data.<n>It remains an open question how, and if these computational approaches can be applied in historical linguistics.
- Score: 49.1574468325115
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To fully exploit the potential of computational phylogenetic methods for cognate data one needs to leverage specific (complex) models an machine learning-based techniques. However, both approaches require datasets that are substantially larger than the manually collected cognate data currently available. To the best of our knowledge, there exists no feasible approach to automatically generate larger cognate datasets. We substantiate this claim by automatically extracting datasets from BabelNet, a large multilingual encyclopedic dictionary. We demonstrate that phylogenetic inferences on the respective character matrices yield trees that are largely inconsistent with the established gold standard ground truth trees. We also discuss why we consider it as being unlikely to be able to extract more suitable character matrices from other multilingual resources. Phylogenetic data analysis approaches that require larger datasets can therefore not be applied to cognate data. Thus, it remains an open question how, and if these computational approaches can be applied in historical linguistics.
Related papers
- Beyond cognacy [0.21756081703275998]
Two fully automated methods are compared to extract phylogenetic signal directly from lexical data.<n>Results show that MSA-based inference yields trees more consistent with linguistic classifications, better predicts typological variation, and provides a clearer phylogenetic signal.
arXiv Detail & Related papers (2025-07-02T06:47:34Z) - Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models [52.22235443948351]
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs)<n>Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale.<n>JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings.
arXiv Detail & Related papers (2025-05-28T11:06:54Z) - From Isolates to Families: Using Neural Networks for Automated Language Affiliation [9.182884165239996]
In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow.<n>Large-scale standardized collections of multilingual wordlists and grammatical language structures might help to improve this and open new avenues for developing automated language affiliation.<n>We present neural network models that use lexical and grammatical data from a worldwide sample of more than 1,000 languages with known affiliations to classify individual languages into families.
arXiv Detail & Related papers (2025-02-17T11:25:32Z) - Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language [7.59001382786429]
This paper applies guided distant supervision to create a large biographical relationship extraction dataset for German.
Our dataset, composed of more than 80,000 instances for nine relationship types, is the largest biographical German relationship extraction dataset.
We train several state-of-the-art machine learning models on the automatically created dataset and release them as well.
arXiv Detail & Related papers (2024-03-25T19:40:26Z) - Multilingual Gradient Word-Order Typology from Universal Dependencies [2.968112652976397]
Existing typological databases, including WALS and Grambank, suffer from inconsistencies primarily caused by their categorical format.
We introduce a new seed dataset made up of continuous-valued data, rather than categorical data, that can better reflect the variability of language.
arXiv Detail & Related papers (2024-02-02T15:54:19Z) - Unearthing Large Scale Domain-Specific Knowledge from Public Corpora [103.0865116794534]
We introduce large models into the data collection pipeline to guide the generation of domain-specific information.<n>We refer to this approach as Retrieve-from-CC.<n>It not only collects data related to domain-specific knowledge but also mines the data containing potential reasoning procedures from the public corpus.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - Offline RL for Natural Language Generation with Implicit Language Q
Learning [87.76695816348027]
Large language models can be inconsistent when it comes to completing user specified tasks.
We propose a novel RL method, that combines both the flexible utility framework of RL with the ability of supervised learning.
In addition to empirically validating ILQL, we present a detailed empirical analysis situations where offline RL can be useful in natural language generation settings.
arXiv Detail & Related papers (2022-06-05T18:38:42Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Iterative Rule Extension for Logic Analysis of Data: an MILP-based
heuristic to derive interpretable binary classification from large datasets [0.6526824510982799]
This work presents IRELAND, an algorithm that allows for abstracting Boolean phrases in DNF from data with up to 10,000 samples and sample characteristics.
The results show that for large datasets IRELAND outperforms the current state-of-the-art and can find solutions for datasets where current models run out of memory or need excessive runtimes.
arXiv Detail & Related papers (2021-10-25T13:31:30Z) - Learning to Synthesize Data for Semantic Parsing [57.190817162674875]
We propose a generative model which models the composition of programs and maps a program to an utterance.
Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand.
We evaluate our method in both in-domain and out-of-domain settings of text-to-Query parsing on the standard benchmarks of GeoQuery and Spider.
arXiv Detail & Related papers (2021-04-12T21:24:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.