Principal Component Analysis as a Sanity Check for Bayesian
Phylolinguistic Reconstruction
- URL: http://arxiv.org/abs/2402.18877v1
- Date: Thu, 29 Feb 2024 05:47:34 GMT
- Title: Principal Component Analysis as a Sanity Check for Bayesian
Phylolinguistic Reconstruction
- Authors: Yugo Murawaki
- Abstract summary: Tree model assumes that languages descended from a common ancestor and underwent modifications over time.
This assumption can be violated to different extents due to contact and other factors.
We propose a simple sanity check: projecting a reconstructed tree onto a space generated by principal component analysis.
- Score: 3.652806821280741
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Bayesian approaches to reconstructing the evolutionary history of languages
rely on the tree model, which assumes that these languages descended from a
common ancestor and underwent modifications over time. However, this assumption
can be violated to different extents due to contact and other factors.
Understanding the degree to which this assumption is violated is crucial for
validating the accuracy of phylolinguistic inference. In this paper, we propose
a simple sanity check: projecting a reconstructed tree onto a space generated
by principal component analysis. By using both synthetic and real data, we
demonstrate that our method effectively visualizes anomalies, particularly in
the form of jogging.
Related papers
- PhyloVAE: Unsupervised Learning of Phylogenetic Trees via Variational Autoencoders [5.505257238864315]
PhyloVAE is an unsupervised learning framework designed for representation learning and generative modeling of tree topologies.
We develop a deep latent-variable generative model that facilitates fast, parallelized topology generation.
Experiments demonstrate PhyloVAE's robust representation learning capabilities and fast generation of phylogenetic tree topologies.
arXiv Detail & Related papers (2025-02-07T07:58:47Z) - PhyloGen: Language Model-Enhanced Phylogenetic Inference via Graph Structure Generation [50.80441546742053]
Phylogenetic trees elucidate evolutionary relationships among species.
Traditional Markov Chain Monte Carlo methods face slow convergence and computational burdens.
We propose PhyloGen, a novel method leveraging a pre-trained genomic language model.
arXiv Detail & Related papers (2024-12-25T08:33:05Z) - Gumbel Counterfactual Generation From Language Models [64.55296662926919]
We show that counterfactual reasoning is conceptually distinct from interventions.
We propose a framework for generating true string counterfactuals.
We show that the approach produces meaningful counterfactuals while at the same time showing that commonly used intervention techniques have considerable undesired side effects.
arXiv Detail & Related papers (2024-11-11T17:57:30Z) - Improved Neural Protoform Reconstruction via Reflex Prediction [11.105362395278142]
We argue that not only should protoforms be inferable from cognate sets (sets of related reflexes) but the reflexes should also be inferable from the protoforms.
We propose a system in which candidate protoforms from a reconstruction model are reranked by a reflex prediction model.
arXiv Detail & Related papers (2024-03-27T17:13:38Z) - Are Sounds Sound for Phylogenetic Reconstruction? [41.85920785319125]
We test, for the first time, the performance of sound-based versus cognate-based approaches to phylogenetic reconstruction.
Our results show that phylogenies reconstructed from lexical cognates are topologically closer, by approximately one third with respect to the generalized quartet distance on average.
arXiv Detail & Related papers (2024-02-05T08:35:33Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - Constructing a Family Tree of Ten Indo-European Languages with
Delexicalized Cross-linguistic Transfer Patterns [57.86480614673034]
We formalize the delexicalized transfer as interpretable tree-to-string and tree-to-tree patterns.
This allows us to quantitatively probe cross-linguistic transfer and extend inquiries of Second Language Acquisition.
arXiv Detail & Related papers (2020-07-17T15:56:54Z) - Exploiting Syntactic Structure for Better Language Modeling: A Syntactic
Distance Approach [78.77265671634454]
We make use of a multi-task objective, i.e., the models simultaneously predict words as well as ground truth parse trees in a form called "syntactic distances"
Experimental results on the Penn Treebank and Chinese Treebank datasets show that when ground truth parse trees are provided as additional training signals, the model is able to achieve lower perplexity and induce trees with better quality.
arXiv Detail & Related papers (2020-05-12T15:35:00Z) - Spectral neighbor joining for reconstruction of latent tree models [5.229354894035374]
We develop Spectral Neighbor Joining, a novel method to recover the structure of latent tree graphical models.
We prove that SNJ is consistent, and derive a sufficient condition for correct tree recovery from an estimated similarity matrix.
We illustrate via extensive simulations that in comparison to several other reconstruction methods, SNJ requires fewer samples to accurately recover trees with a large number of leaves or long edges.
arXiv Detail & Related papers (2020-02-28T05:13:08Z) - A Critical View of the Structural Causal Model [89.43277111586258]
We show that one can identify the cause and the effect without considering their interaction at all.
We propose a new adversarial training method that mimics the disentangled structure of the causal model.
Our multidimensional method outperforms the literature methods on both synthetic and real world datasets.
arXiv Detail & Related papers (2020-02-23T22:52:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.