Exploring The Potential Of GANs In Biological Sequence Analysis
- URL: http://arxiv.org/abs/2303.02421v1
- Date: Sat, 4 Mar 2023 13:46:45 GMT
- Title: Exploring The Potential Of GANs In Biological Sequence Analysis
- Authors: Taslim Murad, Sarwan Ali, Murray Patterson
- Abstract summary: We propose a novel approach to handle the data imbalance issue based on Generative Adversarial Networks (GANs)
GANs are utilized to generate synthetic data that closely resembles the real one.
We perform 3 distinct classification tasks by using 3 different sequence datasets.
- Score: 0.966840768820136
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Biological sequence analysis is an essential step toward building a deeper
understanding of the underlying functions, structures, and behaviors of the
sequences. It can help in identifying the characteristics of the associated
organisms, like viruses, etc., and building prevention mechanisms to eradicate
their spread and impact, as viruses are known to cause epidemics that can
become pandemics globally. New tools for biological sequence analysis are
provided by machine learning (ML) technologies to effectively analyze the
functions and structures of the sequences. However, these ML-based methods
undergo challenges with data imbalance, generally associated with biological
sequence datasets, which hinders their performance. Although various strategies
are present to address this issue, like the SMOTE algorithm, which creates
synthetic data, however, they focus on local information rather than the
overall class distribution. In this work, we explore a novel approach to handle
the data imbalance issue based on Generative Adversarial Networks (GANs) which
use the overall data distribution. GANs are utilized to generate synthetic data
that closely resembles the real one, thus this generated data can be employed
to enhance the ML models' performance by eradicating the class imbalance
problem for biological sequence analysis. We perform 3 distinct classification
tasks by using 3 different sequence datasets (Influenza A Virus, PALMdb, VDjDB)
and our results illustrate that GANs can improve the overall classification
performance.
Related papers
- Learning to refine domain knowledge for biological network inference [2.209921757303168]
Perturbation experiments allow biologists to discover causal relationships between variables of interest.
The sparsity and high dimensionality of these data pose significant challenges for causal structure learning algorithms.
We propose an amortized algorithm for refining domain knowledge, based on data observations.
arXiv Detail & Related papers (2024-10-18T12:53:23Z) - Targeted Cause Discovery with Data-Driven Learning [66.86881771339145]
We propose a novel machine learning approach for inferring causal variables of a target variable from observations.
We employ a neural network trained to identify causality through supervised learning on simulated data.
Empirical results demonstrate the effectiveness of our method in identifying causal relationships within large-scale gene regulatory networks.
arXiv Detail & Related papers (2024-08-29T02:21:11Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - Causal machine learning for single-cell genomics [94.28105176231739]
We discuss the application of machine learning techniques to single-cell genomics and their challenges.
We first present the model that underlies most of current causal approaches to single-cell biology.
We then identify open problems in the application of causal approaches to single-cell data.
arXiv Detail & Related papers (2023-10-23T13:35:24Z) - Fast and Functional Structured Data Generators Rooted in
Out-of-Equilibrium Physics [62.997667081978825]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - Criticality Analysis: Bio-inspired Nonlinear Data Representation [0.0]
Criticality Analysis (CA) is a bio-inspired method of information representation within a controlled self-organised critical system.
The input can be reduced dimensionally to a projection output that retains the features of the overall data, yet has much simpler dynamic response.
The CA method allows for a biologically relevant encoding mechanism of arbitrary input to biosystems, creating a suitable model for information processing in varying complexity of organisms.
arXiv Detail & Related papers (2023-05-11T19:02:09Z) - Unsupervised hierarchical clustering using the learning dynamics of RBMs [0.0]
We present a new and general method for building relational data trees by exploiting the learning dynamics of the Restricted Boltzmann Machine (RBM)
Our method is based on the mean-field approach, derived from the Plefka expansion, and developed in context of disordered systems.
We tested our method in an artificially hierarchical dataset and on three different real-world datasets (images of digits, mutations in the human genome, and a family of proteins)
arXiv Detail & Related papers (2023-02-03T16:53:32Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Learning Neural Causal Models with Active Interventions [83.44636110899742]
We introduce an active intervention-targeting mechanism which enables a quick identification of the underlying causal structure of the data-generating process.
Our method significantly reduces the required number of interactions compared with random intervention targeting.
We demonstrate superior performance on multiple benchmarks from simulated to real-world data.
arXiv Detail & Related papers (2021-09-06T13:10:37Z) - A Trainable Optimal Transport Embedding for Feature Aggregation and its
Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference.
Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.