Related papers: Principal Component Analysis as a Sanity Check for Bayesian Phylolinguistic Reconstruction

Principal Component Analysis as a Sanity Check for Bayesian Phylolinguistic Reconstruction

URL: http://arxiv.org/abs/2402.18877v1
Date: Thu, 29 Feb 2024 05:47:34 GMT
Title: Principal Component Analysis as a Sanity Check for Bayesian Phylolinguistic Reconstruction
Authors: Yugo Murawaki
Abstract summary: Tree model assumes that languages descended from a common ancestor and underwent modifications over time. This assumption can be violated to different extents due to contact and other factors. We propose a simple sanity check: projecting a reconstructed tree onto a space generated by principal component analysis.
Score: 3.652806821280741
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Bayesian approaches to reconstructing the evolutionary history of languages rely on the tree model, which assumes that these languages descended from a common ancestor and underwent modifications over time. However, this assumption can be violated to different extents due to contact and other factors. Understanding the degree to which this assumption is violated is crucial for validating the accuracy of phylolinguistic inference. In this paper, we propose a simple sanity check: projecting a reconstructed tree onto a space generated by principal component analysis. By using both synthetic and real data, we demonstrate that our method effectively visualizes anomalies, particularly in the form of jogging.

Related papers

Unsupervised Protoform Reconstruction through Parsimonious Rule-guided Heuristics and Evolutionary Search [0.0]
Our model integrates data-driven inference with rule-based inference to infer protoforms from cognate sets.<n>We evaluate our method on the task of reconstructing Latin protoforms using a dataset of cognates from five Romance languages.
arXiv Detail & Related papers (2025-06-12T11:58:06Z)
PhyloVAE: Unsupervised Learning of Phylogenetic Trees via Variational Autoencoders [5.505257238864315]
PhyloVAE is an unsupervised learning framework designed for representation learning and generative modeling of tree topologies. We develop a deep latent-variable generative model that facilitates fast, parallelized topology generation. Experiments demonstrate PhyloVAE's robust representation learning capabilities and fast generation of phylogenetic tree topologies.
arXiv Detail & Related papers (2025-02-07T07:58:47Z)
PhyloGen: Language Model-Enhanced Phylogenetic Inference via Graph Structure Generation [50.80441546742053]
Phylogenetic trees elucidate evolutionary relationships among species. Traditional Markov Chain Monte Carlo methods face slow convergence and computational burdens. We propose PhyloGen, a novel method leveraging a pre-trained genomic language model.
arXiv Detail & Related papers (2024-12-25T08:33:05Z)
Counterfactual Generation from Language Models [64.55296662926919]
We show that counterfactual reasoning is conceptually distinct from interventions. We propose a framework for generating true string counterfactuals. Our experiments demonstrate that the approach produces meaningful counterfactuals.
arXiv Detail & Related papers (2024-11-11T17:57:30Z)
Improved Neural Protoform Reconstruction via Reflex Prediction [11.105362395278142]
We argue that not only should protoforms be inferable from cognate sets (sets of related reflexes) but the reflexes should also be inferable from the protoforms. We propose a system in which candidate protoforms from a reconstruction model are reranked by a reflex prediction model.
arXiv Detail & Related papers (2024-03-27T17:13:38Z)
Are Sounds Sound for Phylogenetic Reconstruction? [41.85920785319125]
We test, for the first time, the performance of sound-based versus cognate-based approaches to phylogenetic reconstruction. Our results show that phylogenies reconstructed from lexical cognates are topologically closer, by approximately one third with respect to the generalized quartet distance on average.
arXiv Detail & Related papers (2024-02-05T08:35:33Z)
Sharded Bayesian Additive Regression Trees [1.4213973379473654]
We introduce a randomization auxiliary variable and a sharding tree to decide partitioning of data. By observing that the optimal design of a sharding tree can determine optimal sharding for sub-models on a product space, we introduce an intersection tree structure to completely specify both the sharding and modeling using only tree structures.
arXiv Detail & Related papers (2023-06-01T05:41:31Z)
Posterior Collapse of a Linear Latent Variable Model [6.2255027793924285]
This work identifies the existence and cause of a type of posterior collapse that frequently occurs in the Bayesian deep learning practice. For a general linear latent variable model, we precisely identify the nature of posterior collapse to be the competition between the likelihood and the regularization of the mean due to the prior.
arXiv Detail & Related papers (2022-05-09T02:30:52Z)
A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes. We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z)
Constructing a Family Tree of Ten Indo-European Languages with Delexicalized Cross-linguistic Transfer Patterns [57.86480614673034]
We formalize the delexicalized transfer as interpretable tree-to-string and tree-to-tree patterns. This allows us to quantitatively probe cross-linguistic transfer and extend inquiries of Second Language Acquisition.
arXiv Detail & Related papers (2020-07-17T15:56:54Z)
Towards a Theoretical Understanding of the Robustness of Variational Autoencoders [82.68133908421792]
We make inroads into understanding the robustness of Variational Autoencoders (VAEs) to adversarial attacks and other input perturbations. We develop a novel criterion for robustness in probabilistic models: $r$-robustness. We show that VAEs trained using disentangling methods score well under our robustness metrics.
arXiv Detail & Related papers (2020-07-14T21:22:29Z)
Exploiting Syntactic Structure for Better Language Modeling: A Syntactic Distance Approach [78.77265671634454]
We make use of a multi-task objective, i.e., the models simultaneously predict words as well as ground truth parse trees in a form called "syntactic distances" Experimental results on the Penn Treebank and Chinese Treebank datasets show that when ground truth parse trees are provided as additional training signals, the model is able to achieve lower perplexity and induce trees with better quality.
arXiv Detail & Related papers (2020-05-12T15:35:00Z)
Spectral neighbor joining for reconstruction of latent tree models [5.229354894035374]
We develop Spectral Neighbor Joining, a novel method to recover the structure of latent tree graphical models. We prove that SNJ is consistent, and derive a sufficient condition for correct tree recovery from an estimated similarity matrix. We illustrate via extensive simulations that in comparison to several other reconstruction methods, SNJ requires fewer samples to accurately recover trees with a large number of leaves or long edges.
arXiv Detail & Related papers (2020-02-28T05:13:08Z)
A Critical View of the Structural Causal Model [89.43277111586258]
We show that one can identify the cause and the effect without considering their interaction at all. We propose a new adversarial training method that mimics the disentangled structure of the causal model. Our multidimensional method outperforms the literature methods on both synthetic and real world datasets.
arXiv Detail & Related papers (2020-02-23T22:52:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.