TEPI: Taxonomy-aware Embedding and Pseudo-Imaging for Scarcely-labeled
Zero-shot Genome Classification
- URL: http://arxiv.org/abs/2401.13219v1
- Date: Wed, 24 Jan 2024 04:16:28 GMT
- Title: TEPI: Taxonomy-aware Embedding and Pseudo-Imaging for Scarcely-labeled
Zero-shot Genome Classification
- Authors: Sathyanarayanan Aakur, Vishalini R. Laguduva, Priyadharsini
Ramamurthy, Akhilesh Ramachandran
- Abstract summary: A species' genetic code or genome encodes valuable evolutionary, biological, and phylogenetic information.
Traditional bioinformatics tools have made notable progress but lack scalability and are computationally expensive.
We propose addressing this problem through zero-shot learning using TEPI, taxonomy-aware Embedding and Pseudo-Imaging.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A species' genetic code or genome encodes valuable evolutionary, biological,
and phylogenetic information that aids in species recognition, taxonomic
classification, and understanding genetic predispositions like drug resistance
and virulence. However, the vast number of potential species poses significant
challenges in developing a general-purpose whole genome classification tool.
Traditional bioinformatics tools have made notable progress but lack
scalability and are computationally expensive. Machine learning-based
frameworks show promise but must address the issue of large classification
vocabularies with long-tail distributions. In this study, we propose addressing
this problem through zero-shot learning using TEPI, Taxonomy-aware Embedding
and Pseudo-Imaging. We represent each genome as pseudo-images and map them to a
taxonomy-aware embedding space for reasoning and classification. This embedding
space captures compositional and phylogenetic relationships of species,
enabling predictions in extensive search spaces. We evaluate TEPI using two
rigorous zero-shot settings and demonstrate its generalization capabilities
qualitatively on curated, large-scale, publicly sourced data.
Related papers
- VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - A Saliency-based Clustering Framework for Identifying Aberrant
Predictions [49.1574468325115]
We introduce the concept of aberrant predictions, emphasizing that the nature of classification errors is as critical as their frequency.
We propose a novel, efficient training methodology aimed at both reducing the misclassification rate and discerning aberrant predictions.
We apply this methodology to the less-explored domain of veterinary radiology, where the stakes are high but have not been as extensively studied compared to human medicine.
arXiv Detail & Related papers (2023-11-11T01:53:59Z) - A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect
Dataset [18.211840156134784]
This paper presents a curated million-image dataset, primarily to train computer-vision models capable of providing image-based taxonomic assessment.
The dataset also presents compelling characteristics, the study of which would be of interest to the broader machine learning community.
arXiv Detail & Related papers (2023-07-19T20:54:08Z) - Spatial Implicit Neural Representations for Global-Scale Species Mapping [72.92028508757281]
Given a set of locations where a species has been observed, the goal is to build a model to predict whether the species is present or absent at any location.
Traditional methods struggle to take advantage of emerging large-scale crowdsourced datasets.
We use Spatial Implicit Neural Representations (SINRs) to jointly estimate the geographical range of 47k species simultaneously.
arXiv Detail & Related papers (2023-06-05T03:36:01Z) - Deep Visual-Genetic Biometrics for Taxonomic Classification of Rare
Species [1.9819034119774483]
We propose aligned visual-genetic inference spaces with the aim to implicitly encode cross-domain associations for improved performance.
We experimentally demonstrate the efficacy of the concept via application to microscopic imagery of 30k+ planktic foraminifer shells.
Visual-genetic alignment can significantly benefit visual-only recognition of the rarest species.
arXiv Detail & Related papers (2023-05-11T10:04:27Z) - Self-Supervised Graph Representation Learning for Neuronal Morphologies [75.38832711445421]
We present GraphDINO, a data-driven approach to learn low-dimensional representations of 3D neuronal morphologies from unlabeled datasets.
We show, in two different species and across multiple brain areas, that this method yields morphological cell type clusterings on par with manual feature-based classification by experts.
Our method could potentially enable data-driven discovery of novel morphological features and cell types in large-scale datasets.
arXiv Detail & Related papers (2021-12-23T12:17:47Z) - Fine-Grained Zero-Shot Learning with DNA as Side Information [31.82132159867097]
We use DNA as side information for fine-grained zero-shot classification of species.
We implement a simple hierarchical Bayesian model that uses DNA information to establish the hierarchy in the image space.
We show that DNA can be equally promising yet in general a more accessible alternative than word vectors.
arXiv Detail & Related papers (2021-09-29T01:45:22Z) - Mycorrhiza: Genotype Assignment usingPhylogenetic Networks [2.286041284499166]
We introduce Mycorrhiza, a machine learning approach for the genotype assignment problem.
Our algorithm makes use of phylogenetic networks to engineer features that encode the evolutionary relationships among samples.
Mycorrhiza yields particularly significant gains on datasets with a large average fixation index (FST) or deviation from the Hardy-Weinberg equilibrium.
arXiv Detail & Related papers (2020-10-14T02:36:27Z) - Two-View Fine-grained Classification of Plant Species [66.75915278733197]
We propose a novel method based on a two-view leaf image representation and a hierarchical classification strategy for fine-grained recognition of plant species.
A deep metric based on Siamese convolutional neural networks is used to reduce the dependence on a large number of training samples and make the method scalable to new plant species.
arXiv Detail & Related papers (2020-05-18T21:57:47Z) - Automatic image-based identification and biomass estimation of
invertebrates [70.08255822611812]
Time-consuming sorting and identification of taxa pose strong limitations on how many insect samples can be processed.
We propose to replace the standard manual approach of human expert-based sorting and identification with an automatic image-based technology.
We use state-of-the-art Resnet-50 and InceptionV3 CNNs for the classification task.
arXiv Detail & Related papers (2020-02-05T21:38:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.