Related papers: Learning to Untangle Genome Assembly with Graph Convolutional Networks

Learning to Untangle Genome Assembly with Graph Convolutional Networks

URL: http://arxiv.org/abs/2206.00668v1
Date: Wed, 1 Jun 2022 04:14:25 GMT
Title: Learning to Untangle Genome Assembly with Graph Convolutional Networks
Authors: Lovro Vr\v{c}ek, Xavier Bresson, Thomas Laurent, Martin Schmitz, Mile \v{S}iki\'c
Abstract summary: We introduce a new learning framework to train a graph convolutional network to resolve assembly graphs by finding a correct path through them. Experimental results show that a model, trained on simulated graphs generated solely from a single chromosome, is able to remarkably resolve all other chromosomes.
Score: 17.227634756670835
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: A quest to determine the complete sequence of a human DNA from telomere to telomere started three decades ago and was finally completed in 2021. This accomplishment was a result of a tremendous effort of numerous experts who engineered various tools and performed laborious manual inspection to achieve the first gapless genome sequence. However, such method can hardly be used as a general approach to assemble different genomes, especially when the assembly speed is critical given the large amount of data. In this work, we explore a different approach to the central part of the genome assembly task that consists of untangling a large assembly graph from which a genomic sequence needs to be reconstructed. Our main motivation is to reduce human-engineered heuristics and use deep learning to develop more generalizable reconstruction techniques. Precisely, we introduce a new learning framework to train a graph convolutional network to resolve assembly graphs by finding a correct path through them. The training is supervised with a dataset generated from the resolved CHM13 human sequence and tested on assembly graphs built using real human PacBio HiFi reads. Experimental results show that a model, trained on simulated graphs generated solely from a single chromosome, is able to remarkably resolve all other chromosomes. Moreover, the model outperforms hand-crafted heuristics from a state-of-the-art \textit{de novo} assembler on the same graphs. Reconstructed chromosomes with graph networks are more accurate on nucleotide level, report lower number of contigs, higher genome reconstructed fraction and NG50/NGA50 assessment metrics.

Related papers

Learning Genomic Structure from $k$-mers [2.07180164747172]
We present a method for analyzing read data using contrastive learning.<n>An encoder model is trained to produce embeddings that cluster together sequences from the same genomic region.<n>The model can also be trained fully self-supervised on read data, enabling analysis without the need to construct a full genome assembly.
arXiv Detail & Related papers (2025-05-22T13:46:18Z)
Towards Graph Foundation Models: Learning Generalities Across Graphs via Task-Trees [50.78679002846741]
We introduce a novel approach for learning cross-task generalities in graphs. We propose task-trees as basic learning instances to align task spaces on graphs. Our findings indicate that when a graph neural network is pretrained on diverse task-trees, it acquires transferable knowledge.
arXiv Detail & Related papers (2024-12-21T02:07:43Z)
GraSSRep: Graph-Based Self-Supervised Learning for Repeat Detection in Metagenomic Assembly [24.55141372357102]
Repetitive DNA (repeats) poses significant challenges for accurate and efficient genome assembly and sequence alignment. GraSSRep is a self-supervised learning framework to classify DNA sequences into repetitive and non-repetitive categories. GraSSRep combines sequencing features with pre-defined and learned graph features to achieve state-of-the-art performance in repeat detection.
arXiv Detail & Related papers (2024-02-14T18:26:58Z)
SimTeG: A Frustratingly Simple Approach Improves Textual Graph Learning [131.04781590452308]
We present SimTeG, a frustratingly Simple approach for Textual Graph learning. We first perform supervised parameter-efficient fine-tuning (PEFT) on a pre-trained LM on the downstream task. We then generate node embeddings using the last hidden states of finetuned LM.
arXiv Detail & Related papers (2023-08-03T07:00:04Z)
Graph Generation with Diffusion Mixture [57.78958552860948]
Generation of graphs is a major challenge for real-world tasks that require understanding the complex nature of their non-Euclidean structures. We propose a generative framework that models the topology of graphs by explicitly learning the final graph structures of the diffusion process.
arXiv Detail & Related papers (2023-02-07T17:07:46Z)
Graph Neural Networks for Microbial Genome Recovery [64.91162205624848]
We propose to use Graph Neural Networks (GNNs) to leverage the assembly graph when learning contig representations for metagenomic binning. Our method, VaeG-Bin, combines variational autoencoders for learning latent representations of the individual contigs, with GNNs for refining these representations by taking into account the neighborhood structure of the contigs in the assembly graph.
arXiv Detail & Related papers (2022-04-26T12:49:51Z)
Deep metric learning improves lab of origin prediction of genetically engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations. We propose a method, based on metric learning, that ranks the most likely labs-of-origin. We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z)
DNA-GCN: Graph convolutional networks for predicting DNA-protein binding [4.1600531290054]
We build a sequence k-mer graph and learn DNA Graph Convolutional Network (DNA-GCN) for the whole dataset. DNA-GCN is with a one-hot representation for all nodes, and it then jointly learns the embeddings for both k-mers and sequences. We evaluate our model on 50 datasets from ENCODE.
arXiv Detail & Related papers (2021-06-02T07:36:11Z)
Heterogeneous Similarity Graph Neural Network on Electronic Health Records [74.66674469510251]
We propose Heterogeneous Similarity Graph Neural Network (HSGNN) to analyze EHRs with a novel heterogeneous GNN. Our framework consists of two parts: one is a preprocessing method and the other is an end-to-end GNN. The GNN takes all homogeneous graphs as input and fuses all of them into one graph to make a prediction.
arXiv Detail & Related papers (2021-01-17T23:14:29Z)
Molecular graph generation with Graph Neural Networks [2.7393821783237184]
We introduce a sequential molecular graph generator based on a set of graph neural network modules, which we call MG2N2. Our model is capable of generalizing molecular patterns seen during the training phase, without overfitting.
arXiv Detail & Related papers (2020-12-14T10:32:57Z)
A step towards neural genome assembly [0.0]
We train the MPNN model with max-aggregator to execute several algorithms for graph simplification. We show that the algorithms were learned successfully and can be scaled to graphs of sizes up to 20 times larger than the ones used in training.
arXiv Detail & Related papers (2020-11-10T10:12:19Z)
A deep learning classifier for local ancestry inference [63.8376359764052]
Local ancestry inference identifies the ancestry of each segment of an individual's genome. We develop a new LAI tool using a deep convolutional neural network with an encoder-decoder architecture. We show that our model is able to learn admixture as a zero-shot task, yielding ancestry assignments that are nearly as accurate as those from the existing gold standard tool, RFMix.
arXiv Detail & Related papers (2020-11-04T00:42:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.