Phylo2Vec: a vector representation for binary trees
- URL: http://arxiv.org/abs/2304.12693v4
- Date: Mon, 04 Nov 2024 15:37:52 GMT
- Title: Phylo2Vec: a vector representation for binary trees
- Authors: Matthew J Penn, Neil Scheidwasser, Mark P Khurana, David A DuchĂȘne, Christl A Donnelly, Samir Bhatt,
- Abstract summary: We present Phylo2Vec, a parsimonious encoding for phylogenetic trees.
It serves as a unified approach for both manipulating and representing phylogenetic trees.
As a proof of concept, we use Phylo2Vec for maximum likelihood inference on five real-world datasets.
- Score: 0.49478969093606673
- License:
- Abstract: Binary phylogenetic trees inferred from biological data are central to understanding the shared history among evolutionary units. However, inferring the placement of latent nodes in a tree is computationally expensive. State-of-the-art methods rely on carefully designed heuristics for tree search, using different data structures for easy manipulation (e.g., classes in object-oriented programming languages) and readable representation of trees (e.g., Newick-format strings). Here, we present Phylo2Vec, a parsimonious encoding for phylogenetic trees that serves as a unified approach for both manipulating and representing phylogenetic trees. Phylo2Vec maps any binary tree with $n$ leaves to a unique integer vector of length $n-1$. The advantages of Phylo2Vec are fourfold: i) fast tree sampling, (ii) compressed tree representation compared to a Newick string, iii) quick and unambiguous verification if two binary trees are identical topologically, and iv) systematic ability to traverse tree space in very large or small jumps. As a proof of concept, we use Phylo2Vec for maximum likelihood inference on five real-world datasets and show that a simple hill-climbing-based optimisation scheme can efficiently traverse the vastness of tree space from a random to an optimal tree.
Related papers
- Learning a Decision Tree Algorithm with Transformers [75.96920867382859]
We introduce MetaTree, a transformer-based model trained via meta-learning to directly produce strong decision trees.
We fit both greedy decision trees and globally optimized decision trees on a large number of datasets, and train MetaTree to produce only the trees that achieve strong generalization performance.
arXiv Detail & Related papers (2024-02-06T07:40:53Z) - MAPTree: Beating "Optimal" Decision Trees with Bayesian Decision Trees [2.421336072915701]
We present a Bayesian approach to decision tree induction via maximum a posteriori inference of a posterior distribution over trees.
We propose an AND/OR search algorithm, dubbed MAPTree, which is able to recover the maximum a posteriori tree.
arXiv Detail & Related papers (2023-09-26T23:43:37Z) - Structure-Unified M-Tree Coding Solver for MathWord Problem [57.825176412485504]
In previous work, models designed by taking into account the properties of the binary tree structure of mathematical expressions at the output side have achieved better performance.
In this paper, we propose the Structure-Unified M-Tree Coding Coding (S-UMCr), which applies a tree with any M branches (M-tree) to unify the output structures.
Experimental results on the widely used MAWPS and Math23K datasets have demonstrated that SUMC-r not only outperforms several state-of-the-art models but also performs much better under low-resource conditions.
arXiv Detail & Related papers (2022-10-22T12:20:36Z) - Learning Tree Structures from Leaves For Particle Decay Reconstruction [0.0]
We present a neural approach to reconstructing rooted tree graphs describing hierarchical interactions, using a novel representation we term the Lowest Common Ancestor Generations (LCAG) matrix.
We are able to correctly predict the LCAG purely from leaf features for a maximum tree-depth of $8$ in $92.5%$ of cases for trees up to $6$ leaves (including) and $59.7%$ for trees up to $10$ in our simulated dataset.
arXiv Detail & Related papers (2022-08-31T15:36:47Z) - Spectral Top-Down Recovery of Latent Tree Models [13.681975313065477]
Spectral Top-Down Recovery (STDR) is a divide-and-conquer approach for inference of large latent tree models.
STDR's partitioning step is non-random. Instead, it is based on the Fiedler vector of a suitable Laplacian matrix related to the observed nodes.
We prove that STDR is statistically consistent, and bound the number of samples required to accurately recover the tree with high probability.
arXiv Detail & Related papers (2021-02-26T02:47:42Z) - Visualizing hierarchies in scRNA-seq data using a density tree-biased
autoencoder [50.591267188664666]
We propose an approach for identifying a meaningful tree structure from high-dimensional scRNA-seq data.
We then introduce DTAE, a tree-biased autoencoder that emphasizes the tree structure of the data in low dimensional space.
arXiv Detail & Related papers (2021-02-11T08:48:48Z) - SGA: A Robust Algorithm for Partial Recovery of Tree-Structured
Graphical Models with Noisy Samples [75.32013242448151]
We consider learning Ising tree models when the observations from the nodes are corrupted by independent but non-identically distributed noise.
Katiyar et al. (2020) showed that although the exact tree structure cannot be recovered, one can recover a partial tree structure.
We propose Symmetrized Geometric Averaging (SGA), a more statistically robust algorithm for partial tree recovery.
arXiv Detail & Related papers (2021-01-22T01:57:35Z) - Growing Deep Forests Efficiently with Soft Routing and Learned
Connectivity [79.83903179393164]
This paper further extends the deep forest idea in several important aspects.
We employ a probabilistic tree whose nodes make probabilistic routing decisions, a.k.a., soft routing, rather than hard binary decisions.
Experiments on the MNIST dataset demonstrate that our empowered deep forests can achieve better or comparable performance than [1],[3].
arXiv Detail & Related papers (2020-12-29T18:05:05Z) - Recursive Top-Down Production for Sentence Generation with Latent Trees [77.56794870399288]
We model the production property of context-free grammars for natural and synthetic languages.
We present a dynamic programming algorithm that marginalises over latent binary tree structures with $N$ leaves.
We also present experimental results on German-English translation on the Multi30k dataset.
arXiv Detail & Related papers (2020-10-09T17:47:16Z) - Born-Again Tree Ensembles [9.307453801175177]
Tree ensembles offer a good prediction quality in various domains, but the concurrent use of multiple trees reduces the interpretability of the ensemble.
We study the process of constructing a single decision tree of minimum size that reproduces the exact same behavior as a given tree ensemble in its entire feature space.
This algorithm generates optimal born-again trees for many datasets of practical interest.
arXiv Detail & Related papers (2020-03-24T22:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.