Related papers: Can the Language of the Collation be Translated into the Language of the Stemma? Using Machine Translation for Witness Localization

Can the Language of the Collation be Translated into the Language of the Stemma? Using Machine Translation for Witness Localization

URL: http://arxiv.org/abs/2206.05603v1
Date: Sat, 11 Jun 2022 20:10:21 GMT
Title: Can the Language of the Collation be Translated into the Language of the Stemma? Using Machine Translation for Witness Localization
Authors: Armin Hoenen
Abstract summary: Computational methods are partly shared between the sister discipline of phylogenetics and stemmatology. Deep Learning (DL) has had only minor successes in phylogenetics. In stemmatology, there is to date no known DL approach at all.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Stemmatology is a subfield of philology where one approach to understand the copy-history of textual variants of a text (witnesses of a tradition) is to generate an evolutionary tree. Computational methods are partly shared between the sister discipline of phylogenetics and stemmatology. In 2022, a surveypaper in nature communications found that Deep Learning (DL), which otherwise has brought about major improvements in many fields (Krohn et al 2020) has had only minor successes in phylogenetics and that "it is difficult to conceive of an end-to-end DL model to directly estimate phylogenetic trees from raw data in the near future"(Sapoval et al. 2022, p.8). In stemmatology, there is to date no known DL approach at all. In this paper, we present a new DL approach to placement of manuscripts on a stemma and demonstrate its potential. This could be extended to phylogenetics where the universal code of DNA might be an even better prerequisite for the method using sequence to sequence based neural networks in order to retrieve tree distances.

Related papers

Beyond cognacy [0.21756081703275998]
Two fully automated methods are compared to extract phylogenetic signal directly from lexical data.<n>Results show that MSA-based inference yields trees more consistent with linguistic classifications, better predicts typological variation, and provides a clearer phylogenetic signal.
arXiv Detail & Related papers (2025-07-02T06:47:34Z)
From Sentences to Sequences: Rethinking Languages in Biological System [6.304152224988003]
We revisit the notion of language in biological systems to better understand how NLP successes can be effectively translated to biological domains.<n>By treating the 3D structure of biomolecules as the semantic content of a sentence, we highlight the importance of structural evaluation.
arXiv Detail & Related papers (2025-07-01T16:57:39Z)
The Cognate Data Bottleneck in Language Phylogenetics [49.1574468325115]
Phylogenetic data analysis approaches that require larger datasets can not be applied to cognate data.<n>It remains an open question how, and if these computational approaches can be applied in historical linguistics.
arXiv Detail & Related papers (2025-07-01T16:14:20Z)
Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification [55.98854157265578]
Life-Code is a comprehensive framework that spans different biological functions.<n>We propose a unified pipeline to integrate multi-omics data by reverse-transcribing RNA and reverse-translating amino acids into nucleotide-based sequences.<n>Life-Code achieves state-of-the-art results on various tasks across three omics, highlighting its potential for advancing multi-omics analysis and interpretation.
arXiv Detail & Related papers (2025-02-11T06:53:59Z)
PhyloVAE: Unsupervised Learning of Phylogenetic Trees via Variational Autoencoders [5.505257238864315]
PhyloVAE is an unsupervised learning framework designed for representation learning and generative modeling of tree topologies. We develop a deep latent-variable generative model that facilitates fast, parallelized topology generation. Experiments demonstrate PhyloVAE's robust representation learning capabilities and fast generation of phylogenetic tree topologies.
arXiv Detail & Related papers (2025-02-07T07:58:47Z)
PhyloGen: Language Model-Enhanced Phylogenetic Inference via Graph Structure Generation [50.80441546742053]
Phylogenetic trees elucidate evolutionary relationships among species. Traditional Markov Chain Monte Carlo methods face slow convergence and computational burdens. We propose PhyloGen, a novel method leveraging a pre-trained genomic language model.
arXiv Detail & Related papers (2024-12-25T08:33:05Z)
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z)
Efficient and Scalable Fine-Tune of Language Models for Genome Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes. Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues. textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z)
Patterns of Persistence and Diffusibility across the World's Languages [3.7055269158186874]
Colexification is a type of similarity where a single lexical form is used to convey multiple meanings. We shed light on the linguistic causes of cross-lingual similarity in colexification and phonology. We construct large-scale graphs incorporating semantic, genealogical, phonological and geographical data for 1,966 languages.
arXiv Detail & Related papers (2024-01-03T12:05:38Z)
Can Large Language Models Augment a Biomedical Ontology with missing Concepts and Relations? [1.1060425537315088]
We propose a method that uses semantic interactions with an LLM to analyze clinical practice guidelines. Our initial experimentation with the prompts yielded promising results given a manually generated gold standard.
arXiv Detail & Related papers (2023-11-12T14:20:55Z)
PhyloGFN: Phylogenetic inference with generative flow networks [57.104166650526416]
We introduce the framework of generative flow networks (GFlowNets) to tackle two core problems in phylogenetics: parsimony-based and phylogenetic inference. Because GFlowNets are well-suited for sampling complex structures, they are a natural choice for exploring and sampling from the multimodal posterior distribution over tree topologies. We demonstrate that our amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary hypotheses on real benchmark datasets.
arXiv Detail & Related papers (2023-10-12T23:46:08Z)
Lattice-preserving $\mathcal{ALC}$ ontology embeddings with saturation [50.05281461410368]
An order-preserving embedding method is proposed to generate embeddings of OWL representations. We show that our method outperforms state-the-art theory-of-the-art embedding methods in several knowledge base completion tasks.
arXiv Detail & Related papers (2023-05-11T22:27:51Z)
Taxonomy Enrichment with Text and Graph Vector Representations [61.814256012166794]
We address the problem of taxonomy enrichment which aims at adding new words to the existing taxonomy. We present a new method that allows achieving high results on this task with little effort. We achieve state-of-the-art results across different datasets and provide an in-depth error analysis of mistakes.
arXiv Detail & Related papers (2022-01-21T09:01:12Z)
Constructing a Family Tree of Ten Indo-European Languages with Delexicalized Cross-linguistic Transfer Patterns [57.86480614673034]
We formalize the delexicalized transfer as interpretable tree-to-string and tree-to-tree patterns. This allows us to quantitatively probe cross-linguistic transfer and extend inquiries of Second Language Acquisition.
arXiv Detail & Related papers (2020-07-17T15:56:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.