Hilbert Curve Based Molecular Sequence Analysis
- URL: http://arxiv.org/abs/2412.20616v1
- Date: Sun, 29 Dec 2024 23:26:43 GMT
- Title: Hilbert Curve Based Molecular Sequence Analysis
- Authors: Sarwan Ali, Tamkanat E Ali, Imdad Ullah Khan, Murray Patterson,
- Abstract summary: We propose a universal Hibert curve-based Chaos Game Representation (CGR) method.<n>This method is a transformative function that involves a novel Alphabetic index mapping technique used in constructing Hilbert curve-based image representation from molecular sequences.<n>The proposed method shows promising results as it outperforms current state-of-the-art methods by achieving a high accuracy of $94.5$% and an F1 score of $93.9%$ when tested with the CNN model on the lung cancer dataset.
- Score: 2.949890760187898
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Accurate molecular sequence analysis is a key task in the field of bioinformatics. To apply molecular sequence classification algorithms, we first need to generate the appropriate representations of the sequences. Traditional numeric sequence representation techniques are mostly based on sequence alignment that faces limitations in the form of lack of accuracy. Although several alignment-free techniques have also been introduced, their tabular data form results in low performance when used with Deep Learning (DL) models compared to the competitive performance observed in the case of image-based data. To find a solution to this problem and to make Deep Learning (DL) models function to their maximum potential while capturing the important spatial information in the sequence data, we propose a universal Hibert curve-based Chaos Game Representation (CGR) method. This method is a transformative function that involves a novel Alphabetic index mapping technique used in constructing Hilbert curve-based image representation from molecular sequences. Our method can be globally applied to any type of molecular sequence data. The Hilbert curve-based image representations can be used as input to sophisticated vision DL models for sequence classification. The proposed method shows promising results as it outperforms current state-of-the-art methods by achieving a high accuracy of $94.5$\% and an F1 score of $93.9\%$ when tested with the CNN model on the lung cancer dataset. This approach opens up a new horizon for exploring molecular sequence analysis using image classification methods.
Related papers
- Sequence Analysis Using the Bezier Curve [3.9052860539161918]
We introduce a novel approach to transform sequences into images using the B'ezier curve concept for element mapping.
Mapping the elements onto a curve enhances the sequence information representation in the respective images.
arXiv Detail & Related papers (2025-03-18T15:40:46Z) - Learning-Order Autoregressive Models with Application to Molecular Graph Generation [52.44913282062524]
We introduce a variant of ARM that generates high-dimensional data using a probabilistic ordering that is sequentially inferred from data.
We demonstrate experimentally that our method can learn meaningful autoregressive orderings in image and graph generation.
arXiv Detail & Related papers (2025-03-07T23:24:24Z) - Stochastic gradient descent estimation of generalized matrix factorization models with application to single-cell RNA sequencing data [41.94295877935867]
Single-cell RNA sequencing allows the quantitation of gene expression at the individual cell level.<n> Dimensionality reduction is a common preprocessing step to simplify the visualization, clustering, and phenotypic characterization of samples.<n>We present a generalized matrix factorization model assuming a general exponential dispersion family distribution.<n>We propose a scalable adaptive descent algorithm that allows us to estimate the model efficiently.
arXiv Detail & Related papers (2024-12-29T16:02:15Z) - Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-Based Decoding [84.3224556294803]
Diffusion models excel at capturing the natural design spaces of images, molecules, DNA, RNA, and protein sequences.
We aim to optimize downstream reward functions while preserving the naturalness of these design spaces.
Our algorithm integrates soft value functions, which looks ahead to how intermediate noisy states lead to high rewards in the future.
arXiv Detail & Related papers (2024-08-15T16:47:59Z) - Conditionally-Conjugate Gaussian Process Factor Analysis for Spike Count Data via Data Augmentation [8.114880112033644]
Recently, GPFA has been extended to model spike count data.
We propose a conditionally-conjugate Gaussian process factor analysis (ccGPFA) resulting in both analytically and computationally tractable inference.
arXiv Detail & Related papers (2024-05-19T21:53:36Z) - A Universal Non-Parametric Approach For Improved Molecular Sequence
Analysis [4.588028371034407]
We present a novel approach based on the compression-based Model, motivated from citejiang2023low.
We compress the molecular sequence using well-known compression algorithms, such as Gzip and Bz2.
Next, we employ kernel Principal Component Analysis (PCA) to get the vector representations for the corresponding molecular sequence.
arXiv Detail & Related papers (2024-02-12T23:15:16Z) - Graph Polynomial Convolution Models for Node Classification of
Non-Homophilous Graphs [52.52570805621925]
We investigate efficient learning from higher-order graph convolution and learning directly from adjacency matrix for node classification.
We show that the resulting model lead to new graphs and residual scaling parameter.
We demonstrate that the proposed methods obtain improved accuracy for node-classification of non-homophilous parameters.
arXiv Detail & Related papers (2022-09-12T04:46:55Z) - Dynamic multi feature-class Gaussian process models [0.0]
This study presents a statistical modelling method for automatic learning of shape, pose and intensity features in medical images.
A DMFC-GPM is a Gaussian process (GP)-based model with a shared latent space that encodes linear and non-linear variation.
The model performance results suggest that this new modelling paradigm is robust, accurate, accessible, and has potential applications.
arXiv Detail & Related papers (2021-12-08T15:12:47Z) - Regularization of Mixture Models for Robust Principal Graph Learning [0.0]
A regularized version of Mixture Models is proposed to learn a principal graph from a distribution of $D$-dimensional data points.
Parameters of the model are iteratively estimated through an Expectation-Maximization procedure.
arXiv Detail & Related papers (2021-06-16T18:00:02Z) - Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL)
New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z) - Improved guarantees and a multiple-descent curve for Column Subset
Selection and the Nystr\"om method [76.73096213472897]
We develop techniques which exploit spectral properties of the data matrix to obtain improved approximation guarantees.
Our approach leads to significantly better bounds for datasets with known rates of singular value decay.
We show that both our improved bounds and the multiple-descent curve can be observed on real datasets simply by varying the RBF parameter.
arXiv Detail & Related papers (2020-02-21T00:43:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.