Related papers: Beyond cognacy

Beyond cognacy

URL: http://arxiv.org/abs/2507.03005v1
Date: Wed, 02 Jul 2025 06:47:34 GMT
Title: Beyond cognacy
Authors: Gerhard Jäger,
Abstract summary: Two fully automated methods are compared to extract phylogenetic signal directly from lexical data.<n>Results show that MSA-based inference yields trees more consistent with linguistic classifications, better predicts typological variation, and provides a clearer phylogenetic signal.
Score: 0.21756081703275998
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Computational phylogenetics has become an established tool in historical linguistics, with many language families now analyzed using likelihood-based inference. However, standard approaches rely on expert-annotated cognate sets, which are sparse, labor-intensive to produce, and limited to individual language families. This paper explores alternatives by comparing the established method to two fully automated methods that extract phylogenetic signal directly from lexical data. One uses automatic cognate clustering with unigram/concept features; the other applies multiple sequence alignment (MSA) derived from a pair-hidden Markov model. Both are evaluated against expert classifications from Glottolog and typological data from Grambank. Also, the intrinsic strengths of the phylogenetic signal in the characters are compared. Results show that MSA-based inference yields trees more consistent with linguistic classifications, better predicts typological variation, and provides a clearer phylogenetic signal, suggesting it as a promising, scalable alternative to traditional cognate-based methods. This opens new avenues for global-scale language phylogenies beyond expert annotation bottlenecks.

Related papers

Tokenization and Morphological Fidelity in Uralic NLP: A Cross-Lingual Evaluation [9.23725598061561]
This study systematically compares three subword paradigms -- Byte Pair.<n>(BPE), Overlap BPE (OBPE), and Unigram Language Model -- across six Uralic languages.<n>We show OBPE consistently achieves stronger morphological alignment and higher tagging accuracy than conventional methods.
arXiv Detail & Related papers (2026-02-04T05:59:25Z)
The Cognate Data Bottleneck in Language Phylogenetics [49.1574468325115]
Phylogenetic data analysis approaches that require larger datasets can not be applied to cognate data.<n>It remains an open question how, and if these computational approaches can be applied in historical linguistics.
arXiv Detail & Related papers (2025-07-01T16:14:20Z)
Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels. By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data. The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z)
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z)
Automated Cognate Detection as a Supervised Link Prediction Task with Cognate Transformer [4.609569810881602]
Identification of cognates across related languages is one of the primary problems in historical linguistics. We present a transformer-based architecture inspired by computational biology for the task of automated cognate detection.
arXiv Detail & Related papers (2024-02-05T11:47:36Z)
Are Sounds Sound for Phylogenetic Reconstruction? [41.85920785319125]
We test, for the first time, the performance of sound-based versus cognate-based approaches to phylogenetic reconstruction. Our results show that phylogenies reconstructed from lexical cognates are topologically closer, by approximately one third with respect to the generalized quartet distance on average.
arXiv Detail & Related papers (2024-02-05T08:35:33Z)
Gene Set Summarization using Large Language Models [1.312659265502151]
We develop a method that uses GPT models to perform gene set function summarization. We demonstrate that these methods are able to generate plausible and biologically valid summary GO term lists for gene sets. However, GPT-based approaches are unable to deliver reliable scores or p-values and often return terms that are not statistically significant.
arXiv Detail & Related papers (2023-05-21T02:06:33Z)
Can the Language of the Collation be Translated into the Language of the Stemma? Using Machine Translation for Witness Localization [0.0]
Computational methods are partly shared between the sister discipline of phylogenetics and stemmatology. Deep Learning (DL) has had only minor successes in phylogenetics. In stemmatology, there is to date no known DL approach at all.
arXiv Detail & Related papers (2022-06-11T20:10:21Z)
A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes. We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z)
Exploiting Language Model for Efficient Linguistic Steganalysis: An Empirical Study [23.311007481830647]
We present two methods to efficient linguistic steganalysis. One is to pre-train a language model based on RNN, and the other is to pre-train a sequence autoencoder.
arXiv Detail & Related papers (2021-07-26T12:37:18Z)
Linguistic Typology Features from Text: Inferring the Sparse Features of World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers. We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z)
Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures. We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.