Deep metric learning improves lab of origin prediction of genetically
engineered plasmids
- URL: http://arxiv.org/abs/2111.12606v1
- Date: Wed, 24 Nov 2021 16:29:03 GMT
- Title: Deep metric learning improves lab of origin prediction of genetically
engineered plasmids
- Authors: Igor M. Soares, Fernando H. F. Camargo, Adriano Marques, Oliver M.
Crook
- Abstract summary: Genetic engineering attribution (GEA) is the ability to make sequence-lab associations.
We propose a method, based on metric learning, that ranks the most likely labs-of-origin.
We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
- Score: 63.05016513788047
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Genome engineering is undergoing unprecedented development and is now
becoming widely available. To ensure responsible biotechnology innovation and
to reduce misuse of engineered DNA sequences, it is vital to develop tools to
identify the lab-of-origin of engineered plasmids. Genetic engineering
attribution (GEA), the ability to make sequence-lab associations, would support
forensic experts in this process. Here, we propose a method, based on metric
learning, that ranks the most likely labs-of-origin whilst simultaneously
generating embeddings for plasmid sequences and labs. These embeddings can be
used to perform various downstream tasks, such as clustering DNA sequences and
labs, as well as using them as features in machine learning models. Our
approach employs a circular shift augmentation approach and is able to
correctly rank the lab-of-origin $90\%$ of the time within its top 10
predictions - outperforming all current state-of-the-art approaches. We also
demonstrate that we can perform few-shot-learning and obtain $76\%$ top-10
accuracy using only $10\%$ of the sequences. This means, we outperform the
previous CNN approach using only one-tenth of the data. We also demonstrate
that we are able to extract key signatures in plasmid sequences for particular
labs, allowing for an interpretable examination of the model's outputs.
Related papers
- Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - Horizon-wise Learning Paradigm Promotes Gene Splicing Identification [6.225959701339916]
We propose a novel framework for the task of gene splicing identification, named Horizon-wise Gene Splicing Identification (H-GSI)
The proposed H-GSI follows the horizon-wise identification paradigm and comprises four components: the pre-processing procedure transforming string data into tensors, the sliding window technique handling long sequences, the SeqLab model, and the predictor.
In contrast to existing studies that process gene information with a truncated fixed-length sequence, H-GSI employs a horizon-wise identification paradigm in which all positions in a sequence are predicted with only one forward computation.
arXiv Detail & Related papers (2024-06-15T08:18:09Z) - BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments [112.25067497985447]
We introduce BioDiscoveryAgent, an agent that designs new experiments, reasons about their outcomes, and efficiently navigates the hypothesis space to reach desired solutions.
BioDiscoveryAgent can uniquely design new experiments without the need to train a machine learning model.
It achieves an average of 21% improvement in predicting relevant genetic perturbations across six datasets.
arXiv Detail & Related papers (2024-05-27T19:57:17Z) - Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine
Learning [54.247560894146105]
Inverse design of short single-stranded RNA and DNA sequences (aptamers) is the task of finding sequences that satisfy a set of desired criteria.
We propose to use an unsupervised machine learning model known as the Potts model to discover new, useful sequences with controllable sequence diversity.
arXiv Detail & Related papers (2022-08-10T13:30:58Z) - Learning to Untangle Genome Assembly with Graph Convolutional Networks [17.227634756670835]
We introduce a new learning framework to train a graph convolutional network to resolve assembly graphs by finding a correct path through them.
Experimental results show that a model, trained on simulated graphs generated solely from a single chromosome, is able to remarkably resolve all other chromosomes.
arXiv Detail & Related papers (2022-06-01T04:14:25Z) - Ranking labs-of-origin for genetically engineered DNA using Metric
Learning [0.0]
We show our proposed method to rank the most likely labs-of-origin and generate embeddings for DNA sequences and labs.
These embeddings can also perform various other tasks, like clustering both DNA sequences and labs.
arXiv Detail & Related papers (2021-07-16T13:06:47Z) - Data-Driven Logistic Regression Ensembles With Applications in Genomics [0.0]
We propose a new approach for dealing with high-dimensional binary classification problems that combines ideas from regularization and ensembling.
We demonstrate the good performance of our method in terms of prediction accuracy and identification of key biomarkers using several medical datasets involving common diseases such as cancer, multiple sclerosis and psoriasis.
arXiv Detail & Related papers (2021-02-17T05:57:26Z) - Knowledge transfer across cell lines using Hybrid Gaussian Process
models with entity embedding vectors [62.997667081978825]
A large number of experiments are performed to develop a biochemical process.
Could we exploit data of already developed processes to make predictions for a novel process, we could significantly reduce the number of experiments needed.
arXiv Detail & Related papers (2020-11-27T17:38:15Z) - A deep learning classifier for local ancestry inference [63.8376359764052]
Local ancestry inference identifies the ancestry of each segment of an individual's genome.
We develop a new LAI tool using a deep convolutional neural network with an encoder-decoder architecture.
We show that our model is able to learn admixture as a zero-shot task, yielding ancestry assignments that are nearly as accurate as those from the existing gold standard tool, RFMix.
arXiv Detail & Related papers (2020-11-04T00:42:01Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.