A Scalable Trie Building Algorithm for High-Throughput Phyloanalysis of Wafer-Scale Digital Evolution Experiments
- URL: http://arxiv.org/abs/2508.15074v1
- Date: Wed, 20 Aug 2025 21:18:51 GMT
- Title: A Scalable Trie Building Algorithm for High-Throughput Phyloanalysis of Wafer-Scale Digital Evolution Experiments
- Authors: Vivaan Singhvi, Joey Wagner, Emily Dolson, Luis Zaman, Matthew Andres Moreno,
- Abstract summary: High-resolution snapshots of lineage ancestries from digital experiments are key to investigations of evolvability and open-ended evolution.<n>Advances in AI/ML hardware accelerator devices, such as the 850,000-processor Cerebras Wafer-Scale Engine (WSE), are poised to broaden the scope of evolutionary questions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Agent-based simulation platforms play a key role in enabling fast-to-run evolution experiments that can be precisely controlled and observed in detail. Availability of high-resolution snapshots of lineage ancestries from digital experiments, in particular, is key to investigations of evolvability and open-ended evolution, as well as in providing a validation testbed for bioinformatics method development. Ongoing advances in AI/ML hardware accelerator devices, such as the 850,000-processor Cerebras Wafer-Scale Engine (WSE), are poised to broaden the scope of evolutionary questions that can be investigated in silico. However, constraints in memory capacity and locality characteristic of these systems introduce difficulties in exhaustively tracking phylogenies at runtime. To overcome these challenges, recent work on hereditary stratigraphy algorithms has developed space-efficient genetic markers to facilitate fully decentralized estimation of relatedness among digital organisms. However, in existing work, compute time to reconstruct phylogenies from these genetic markers has proven a limiting factor in achieving large-scale phyloanalyses. Here, we detail an improved trie-building algorithm designed to produce reconstructions equivalent to existing approaches. For modestly-sized 10,000-tip trees, the proposed approach achieves a 300-fold speedup versus existing state-of-the-art. Finally, using 1 billion genome datasets drawn from WSE simulations encompassing 954 trillion replication events, we report a pair of large-scale phylogeny reconstruction trials, achieving end-to-end reconstruction times of 2.6 and 2.9 hours. In substantially improving reconstruction scaling and throughput, presented work establishes a key foundation to enable powerful high-throughput phyloanalysis techniques in large-scale digital evolution experiments.
Related papers
- Murmur2Vec: A Hashing Based Solution For Embedding Generation Of COVID-19 Spike Sequences [4.970277730082774]
Early detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning.<n>Existing approaches face notable limitations. Phylogenetic tree-based methods are computationally intensive and do not scale efficiently to today's multi-million-sequence datasets.<n>In this study, we focus on the most prevalent SARS-CoV-2 lineages associated with the spike protein region and introduce a scalable embedding method that leverages hashing to generate compact, low-dimensional representations of spike sequences.
arXiv Detail & Related papers (2025-12-10T23:03:10Z) - Distilled Protein Backbone Generation [59.63474232035653]
Diffusion- and flow-based generative models offer unprecedented capabilities for de novo protein design.<n>These models are limited by their generating speed, often requiring hundreds of iterative steps in the reverse-diffusion process.<n>We show how to appropriately adapt Score identity Distillation (SiD), a state-of-the-art score distillation strategy, to train few-step protein backbone generators.
arXiv Detail & Related papers (2025-10-03T15:25:08Z) - AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model [92.51919604882984]
We introduce AMix-1, a powerful protein foundation model built on Flow Bayesian Networks.<n>AMix-1 is empowered by a systematic training methodology, encompassing pretraining scaling laws, emergent capability analysis, in-context learning mechanism, and test-time scaling algorithm.<n>Building on this foundation, we devise a multiple sequence alignment (MSA)-based in-context learning strategy to unify protein design into a general framework.
arXiv Detail & Related papers (2025-07-11T17:02:25Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - A Guide to Tracking Phylogenies in Parallel and Distributed Agent-based Evolution Models [0.0]
In silico work with agent-based models provides an opportunity to collect high-quality records of ancestry relationships among simulated agents.
Existing work generally tracks lineages directly, yielding an exact phylogenetic record of evolutionary history.
Post hoc estimation is akin to how bioinformaticians build phylogenies by assessing genetic similarities between organisms.
arXiv Detail & Related papers (2024-05-16T15:27:51Z) - Trackable Island-model Genetic Algorithms at Wafer Scale [0.0]
We present a tracking-enabled asynchronous island-based genetic algorithm (GA) framework for Cerebras Wafer-Scale Engine (WSE) hardware.
We validate phylogenetic reconstructions and demonstrate their suitability for inference of underlying evolutionary conditions.
These benchmark and validation trials reflect strong potential for highly scalable evolutionary computation.
arXiv Detail & Related papers (2024-05-06T16:17:33Z) - Trackable Agent-based Evolution Models at Wafer Scale [0.0]
We focus on the problem of extracting phylogenetic information from agent-based evolution on the 850,000 processor Cerebras Wafer Scale Engine (WSE)
We present an asynchronous island-based genetic algorithm (GA) framework for WSE hardware.
We validate phylogenetic reconstructions from these trials and demonstrate their suitability for inference of underlying evolutionary conditions.
arXiv Detail & Related papers (2024-04-16T19:24:14Z) - Phylogeny-informed fitness estimation [58.720142291102135]
We propose phylogeny-informed fitness estimation, which exploits a population's phylogeny to estimate fitness evaluations.
Our results indicate that phylogeny-informed fitness estimation can mitigate the drawbacks of down-sampled lexicase.
This work serves as an initial step toward improving evolutionary algorithms by exploiting runtime phylogenetic analysis.
arXiv Detail & Related papers (2023-06-06T19:05:01Z) - Modelling Technical and Biological Effects in scRNA-seq data with
Scalable GPLVMs [6.708052194104378]
We extend a popular approach for probabilistic non-linear dimensionality reduction, the Gaussian process latent variable model, to scale to massive single-cell datasets.
The key idea is to use an augmented kernel which preserves the factorisability of the lower bound allowing for fast variational inference.
arXiv Detail & Related papers (2022-09-14T15:25:15Z) - Deep metric learning improves lab of origin prediction of genetically
engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations.
We propose a method, based on metric learning, that ranks the most likely labs-of-origin.
We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.