Learning to Untangle Genome Assembly with Graph Convolutional Networks
- URL: http://arxiv.org/abs/2206.00668v1
- Date: Wed, 1 Jun 2022 04:14:25 GMT
- Title: Learning to Untangle Genome Assembly with Graph Convolutional Networks
- Authors: Lovro Vr\v{c}ek, Xavier Bresson, Thomas Laurent, Martin Schmitz, Mile
\v{S}iki\'c
- Abstract summary: We introduce a new learning framework to train a graph convolutional network to resolve assembly graphs by finding a correct path through them.
Experimental results show that a model, trained on simulated graphs generated solely from a single chromosome, is able to remarkably resolve all other chromosomes.
- Score: 17.227634756670835
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A quest to determine the complete sequence of a human DNA from telomere to
telomere started three decades ago and was finally completed in 2021. This
accomplishment was a result of a tremendous effort of numerous experts who
engineered various tools and performed laborious manual inspection to achieve
the first gapless genome sequence. However, such method can hardly be used as a
general approach to assemble different genomes, especially when the assembly
speed is critical given the large amount of data. In this work, we explore a
different approach to the central part of the genome assembly task that
consists of untangling a large assembly graph from which a genomic sequence
needs to be reconstructed. Our main motivation is to reduce human-engineered
heuristics and use deep learning to develop more generalizable reconstruction
techniques. Precisely, we introduce a new learning framework to train a graph
convolutional network to resolve assembly graphs by finding a correct path
through them. The training is supervised with a dataset generated from the
resolved CHM13 human sequence and tested on assembly graphs built using real
human PacBio HiFi reads. Experimental results show that a model, trained on
simulated graphs generated solely from a single chromosome, is able to
remarkably resolve all other chromosomes. Moreover, the model outperforms
hand-crafted heuristics from a state-of-the-art \textit{de novo} assembler on
the same graphs. Reconstructed chromosomes with graph networks are more
accurate on nucleotide level, report lower number of contigs, higher genome
reconstructed fraction and NG50/NGA50 assessment metrics.
Related papers
- GraSSRep: Graph-Based Self-Supervised Learning for Repeat Detection in
Metagenomic Assembly [24.55141372357102]
Repetitive DNA (repeats) poses significant challenges for accurate and efficient genome assembly and sequence alignment.
GraSSRep is a self-supervised learning framework to classify DNA sequences into repetitive and non-repetitive categories.
GraSSRep combines sequencing features with pre-defined and learned graph features to achieve state-of-the-art performance in repeat detection.
arXiv Detail & Related papers (2024-02-14T18:26:58Z) - SimTeG: A Frustratingly Simple Approach Improves Textual Graph Learning [131.04781590452308]
We present SimTeG, a frustratingly Simple approach for Textual Graph learning.
We first perform supervised parameter-efficient fine-tuning (PEFT) on a pre-trained LM on the downstream task.
We then generate node embeddings using the last hidden states of finetuned LM.
arXiv Detail & Related papers (2023-08-03T07:00:04Z) - Graph Generation with Diffusion Mixture [57.78958552860948]
Generation of graphs is a major challenge for real-world tasks that require understanding the complex nature of their non-Euclidean structures.
We propose a generative framework that models the topology of graphs by explicitly learning the final graph structures of the diffusion process.
arXiv Detail & Related papers (2023-02-07T17:07:46Z) - Graph Neural Networks for Microbial Genome Recovery [64.91162205624848]
We propose to use Graph Neural Networks (GNNs) to leverage the assembly graph when learning contig representations for metagenomic binning.
Our method, VaeG-Bin, combines variational autoencoders for learning latent representations of the individual contigs, with GNNs for refining these representations by taking into account the neighborhood structure of the contigs in the assembly graph.
arXiv Detail & Related papers (2022-04-26T12:49:51Z) - Deep metric learning improves lab of origin prediction of genetically
engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations.
We propose a method, based on metric learning, that ranks the most likely labs-of-origin.
We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z) - DNA-GCN: Graph convolutional networks for predicting DNA-protein binding [4.1600531290054]
We build a sequence k-mer graph and learn DNA Graph Convolutional Network (DNA-GCN) for the whole dataset.
DNA-GCN is with a one-hot representation for all nodes, and it then jointly learns the embeddings for both k-mers and sequences.
We evaluate our model on 50 datasets from ENCODE.
arXiv Detail & Related papers (2021-06-02T07:36:11Z) - Heterogeneous Similarity Graph Neural Network on Electronic Health
Records [74.66674469510251]
We propose Heterogeneous Similarity Graph Neural Network (HSGNN) to analyze EHRs with a novel heterogeneous GNN.
Our framework consists of two parts: one is a preprocessing method and the other is an end-to-end GNN.
The GNN takes all homogeneous graphs as input and fuses all of them into one graph to make a prediction.
arXiv Detail & Related papers (2021-01-17T23:14:29Z) - Molecular graph generation with Graph Neural Networks [2.7393821783237184]
We introduce a sequential molecular graph generator based on a set of graph neural network modules, which we call MG2N2.
Our model is capable of generalizing molecular patterns seen during the training phase, without overfitting.
arXiv Detail & Related papers (2020-12-14T10:32:57Z) - A step towards neural genome assembly [0.0]
We train the MPNN model with max-aggregator to execute several algorithms for graph simplification.
We show that the algorithms were learned successfully and can be scaled to graphs of sizes up to 20 times larger than the ones used in training.
arXiv Detail & Related papers (2020-11-10T10:12:19Z) - A deep learning classifier for local ancestry inference [63.8376359764052]
Local ancestry inference identifies the ancestry of each segment of an individual's genome.
We develop a new LAI tool using a deep convolutional neural network with an encoder-decoder architecture.
We show that our model is able to learn admixture as a zero-shot task, yielding ancestry assignments that are nearly as accurate as those from the existing gold standard tool, RFMix.
arXiv Detail & Related papers (2020-11-04T00:42:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.