Multidimensional Scaling for Gene Sequence Data with Autoencoders
- URL: http://arxiv.org/abs/2104.09014v1
- Date: Mon, 19 Apr 2021 02:14:17 GMT
- Title: Multidimensional Scaling for Gene Sequence Data with Autoencoders
- Authors: Pulasthi Wickramasinghe, Geoffrey Fox
- Abstract summary: We present an autoencoder-based dimensional reduction model which can easily scale to datasets containing millions of gene sequences.
The proposed model is evaluated against DAMDS with a real world fungi gene sequence dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multidimensional scaling of gene sequence data has long played a vital role
in analysing gene sequence data to identify clusters and patterns. However the
computation complexities and memory requirements of state-of-the-art
dimensional scaling algorithms make it infeasible to scale to large datasets.
In this paper we present an autoencoder-based dimensional reduction model which
can easily scale to datasets containing millions of gene sequences, while
attaining results comparable to state-of-the-art MDS algorithms with minimal
resource requirements. The model also supports out-of-sample data points with a
99.5%+ accuracy based on our experiments. The proposed model is evaluated
against DAMDS with a real world fungi gene sequence dataset. The presented
results showcase the effectiveness of the autoencoder-based dimension reduction
model and its advantages.
Related papers
- Weighted Diversified Sampling for Efficient Data-Driven Single-Cell Gene-Gene Interaction Discovery [56.622854875204645]
We present an innovative approach utilizing data-driven computational tools, leveraging an advanced Transformer model, to unearth gene-gene interactions.
A novel weighted diversified sampling algorithm computes the diversity score of each data sample in just two passes of the dataset.
arXiv Detail & Related papers (2024-10-21T03:35:23Z) - An Autoencoder and Generative Adversarial Networks Approach for Multi-Omics Data Imbalanced Class Handling and Classification [2.2940141855172036]
In molecular biology, there has been an explosion of data generated from multi-omics sequencing.
Traditional statistical methods face challenging tasks when dealing with such high dimensional data.
This study, focused on tackling these challenges in a neural network that incorporates autoencoder to extract latent space of the features.
arXiv Detail & Related papers (2024-05-16T01:45:55Z) - A Bayesian Gaussian Process-Based Latent Discriminative Generative Decoder (LDGD) Model for High-Dimensional Data [0.41942958779358674]
latent discriminative generative decoder (LDGD) employs both the data and associated labels in the manifold discovery process.
We show that LDGD can robustly infer manifold and precisely predict labels for scenarios in that data size is limited.
arXiv Detail & Related papers (2024-01-29T19:11:03Z) - Genetic InfoMax: Exploring Mutual Information Maximization in
High-Dimensional Imaging Genetics Studies [50.11449968854487]
Genome-wide association studies (GWAS) are used to identify relationships between genetic variations and specific traits.
Representation learning for imaging genetics is largely under-explored due to the unique challenges posed by GWAS.
We introduce a trans-modal learning framework Genetic InfoMax (GIM) to address the specific challenges of GWAS.
arXiv Detail & Related papers (2023-09-26T03:59:21Z) - Optimizations of Autoencoders for Analysis and Classification of
Microscopic In Situ Hybridization Images [68.8204255655161]
We propose a deep-learning framework to detect and classify areas of microscopic images with similar levels of gene expression.
The data we analyze requires an unsupervised learning model for which we employ a type of Artificial Neural Network - Deep Learning Autoencoders.
arXiv Detail & Related papers (2023-04-19T13:45:28Z) - RandomSCM: interpretable ensembles of sparse classifiers tailored for
omics data [59.4141628321618]
We propose an ensemble learning algorithm based on conjunctions or disjunctions of decision rules.
The interpretability of the models makes them useful for biomarker discovery and patterns discovery in high dimensional data.
arXiv Detail & Related papers (2022-08-11T13:55:04Z) - SubOmiEmbed: Self-supervised Representation Learning of Multi-omics Data
for Cancer Type Classification [4.992154875028543]
Integration and analysis of multi-omics data give us a broad view of tumours, which can improve clinical decision making.
SubOmiEmbed produces comparable results to the baseline OmiEmbed with a much smaller network and by using just a subset of the data.
This work can be improved to integrate mutation-based genomic data as well.
arXiv Detail & Related papers (2022-02-03T16:39:09Z) - Multimodal Data Fusion in High-Dimensional Heterogeneous Datasets via
Generative Models [16.436293069942312]
We are interested in learning probabilistic generative models from high-dimensional heterogeneous data in an unsupervised fashion.
We propose a general framework that combines disparate data types through the exponential family of distributions.
The proposed algorithm is presented in detail for the commonly encountered heterogeneous datasets with real-valued (Gaussian) and categorical (multinomial) features.
arXiv Detail & Related papers (2021-08-27T18:10:31Z) - Learning complex dependency structure of gene regulatory networks from
high dimensional micro-array data with Gaussian Bayesian networks [0.0]
Gene expression datasets consist of thousand of genes with relatively small samplesizes.
Glasso algorithm has been proposed to deal with high dimensional micro-array datasets forcing sparsity.
modifications of the default Glasso algorithm are developed to overcome the problem of complex interaction structure.
arXiv Detail & Related papers (2021-06-28T15:04:35Z) - Generalized Matrix Factorization: efficient algorithms for fitting
generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses.
Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets.
We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.