Simulating realistic short tandem repeat capillary electrophoretic signal using a generative adversarial network
- URL: http://arxiv.org/abs/2408.16169v1
- Date: Wed, 28 Aug 2024 23:20:17 GMT
- Title: Simulating realistic short tandem repeat capillary electrophoretic signal using a generative adversarial network
- Authors: Duncan Taylor, Melissa Humphries,
- Abstract summary: We develop a generative adversarial network, GAN, modified from the pix2pix GAN to achieve this task.
With 1078 DNA profiles we train the GAN and achieve the ability to simulate DNA profile information.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: DNA profiles are made up from multiple series of electrophoretic signal measuring fluorescence over time. Typically, human DNA analysts 'read' DNA profiles using their experience to distinguish instrument noise, artefactual signal, and signal corresponding to DNA fragments of interest. Recent work has developed an artificial neural network, ANN, to carry out the task of classifying fluorescence types into categories in DNA profile electrophoretic signal. But the creation of the necessarily large amount of labelled training data for the ANN is time consuming and expensive, and a limiting factor in the ability to robustly train the ANN. If realistic, prelabelled, training data could be simulated then this would remove the barrier to training an ANN with high efficacy. Here we develop a generative adversarial network, GAN, modified from the pix2pix GAN to achieve this task. With 1078 DNA profiles we train the GAN and achieve the ability to simulate DNA profile information, and then use the generator from the GAN as a 'realism filter' that applies the noise and artefact elements exhibited in typical electrophoretic signal.
Related papers
- HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model [70.69095062674944]
We propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture.
This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution.
HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks.
arXiv Detail & Related papers (2025-02-15T14:23:43Z) - deepNoC: A deep learning system to assign the number of contributors to a short tandem repeat DNA profile [0.0]
We develop an analysis pipeline that simulates the electrophoretic signal of an STR profile, allowing virtually unlimited, pre-labelled training material to be generated.
We show that by simulating 100 000 profiles and training a number of contributors estimation tool using a deep neural network architecture (in an algorithm named deepNoC) that a high level of performance is achieved 89% for 1 to 10 contributors.
arXiv Detail & Related papers (2024-12-13T02:42:56Z) - VADA: a Data-Driven Simulator for Nanopore Sequencing [3.909855210960908]
We propose a purely data-driven method for simulating nanopores based on an autoregressive latent variable model.
We empirically demonstrate that our model achieves competitive simulation performance on experimental nanopore data.
We show we have learned an informative latent representation that is predictive of the DNA labels.
arXiv Detail & Related papers (2024-04-12T13:24:28Z) - Data-Independent Operator: A Training-Free Artifact Representation
Extractor for Generalizable Deepfake Detection [105.9932053078449]
In this work, we show that, on the contrary, the small and training-free filter is sufficient to capture more general artifact representations.
Due to its unbias towards both the training and test sources, we define it as Data-Independent Operator (DIO) to achieve appealing improvements on unseen sources.
Our detector achieves a remarkable improvement of $13.3%$, establishing a new state-of-the-art performance.
arXiv Detail & Related papers (2024-03-11T15:22:28Z) - Latent Diffusion Model for DNA Sequence Generation [5.194506374366898]
We propose a novel latent diffusion model, DiscDiff, tailored for discrete DNA sequence generation.
By simply embedding discrete DNA sequences into a continuous latent space using an autoencoder, we are able to leverage the powerful generative abilities of continuous diffusion models for the generation of discrete data.
We contribute a comprehensive cross-species dataset of 150K unique promoter-gene sequences from 15 species, enriching resources for future generative modelling in genomics.
arXiv Detail & Related papers (2023-10-09T20:58:52Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - Generative Adversarial Networks for Data Augmentation [0.0]
GANs have been utilized in medical image analysis for various tasks, including data augmentation, image creation, and domain adaptation.
GANs can generate synthetic samples that can be used to increase the available dataset.
It is essential to note that the use of GANs in medical imaging is still an active area of research to ensure that the produced images are of high quality and suitable for use in clinical settings.
arXiv Detail & Related papers (2023-06-03T06:33:33Z) - How far generated data can impact Neural Networks performance? [2.578242050187029]
We consider how far generated data can aid real data in improving the performance of Neural Networks.
In our experiments, we find out that 5-times more synthetic data to the real FEs dataset increases accuracy by 16%.
arXiv Detail & Related papers (2023-03-27T14:02:43Z) - NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction [79.13750275141139]
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction.
The desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network.
A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details.
arXiv Detail & Related papers (2022-09-29T04:06:00Z) - Deep metric learning improves lab of origin prediction of genetically
engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations.
We propose a method, based on metric learning, that ranks the most likely labs-of-origin.
We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z) - Convolutional Neural Networks for Sleep Stage Scoring on a Two-Channel
EEG Signal [63.18666008322476]
Sleep problems are one of the major diseases all over the world.
Basic tool used by specialists is the Polysomnogram, which is a collection of different signals recorded during sleep.
Specialists have to score the different signals according to one of the standard guidelines.
arXiv Detail & Related papers (2021-03-30T09:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.