Computing Gram Matrix for SMILES Strings using RDKFingerprint and Sinkhorn-Knopp Algorithm
- URL: http://arxiv.org/abs/2412.14717v1
- Date: Thu, 19 Dec 2024 10:31:25 GMT
- Title: Computing Gram Matrix for SMILES Strings using RDKFingerprint and Sinkhorn-Knopp Algorithm
- Authors: Sarwan Ali, Haris Mansoor, Prakash Chourasia, Imdad Ullah Khan, Murray Patterson,
- Abstract summary: In molecular structure data, SMILES (Simplified Molecular Input Line Entry System) strings are used to analyze molecular structure design.<n>This work proposes a kernel-based approach for encoding and analyzing molecular structures from SMILES strings.
- Score: 3.9146761527401424
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In molecular structure data, SMILES (Simplified Molecular Input Line Entry System) strings are used to analyze molecular structure design. Numerical feature representation of SMILES strings is a challenging task. This work proposes a kernel-based approach for encoding and analyzing molecular structures from SMILES strings. The proposed approach involves computing a kernel matrix using the Sinkhorn-Knopp algorithm while using kernel principal component analysis (PCA) for dimensionality reduction. The resulting low-dimensional embeddings are then used for classification and regression analysis. The kernel matrix is computed by converting the SMILES strings into molecular structures using the Morgan Fingerprint, which computes a fingerprint for each molecule. The distance matrix is computed using the pairwise kernels function. The Sinkhorn-Knopp algorithm is used to compute the final kernel matrix that satisfies the constraints of a probability distribution. This is achieved by iteratively adjusting the kernel matrix until the marginal distributions of the rows and columns match the desired marginal distributions. We provided a comprehensive empirical analysis of the proposed kernel method to evaluate its goodness with greater depth. The suggested method is assessed for drug subcategory prediction (classification task) and solubility AlogPS ``Aqueous solubility and Octanol/Water partition coefficient" (regression task) using the benchmark SMILES string dataset. The outcomes show the proposed method outperforms several baseline methods in terms of supervised analysis and has potential uses in molecular design and drug discovery. Overall, the suggested method is a promising avenue for kernel methods-based molecular structure analysis and design.
Related papers
- Toward Efficient Kernel-Based Solvers for Nonlinear PDEs [19.975293084297014]
This paper introduces a novel kernel learning framework toward efficiently solving nonlinear partial differential equations (PDEs)
In contrast to the state-of-the-art kernel solver that embeds differential operators within kernels, our approach eliminates these operators from the kernel.
We model the solution using a standard kernel form and differentiate the interpolant to compute the derivatives.
arXiv Detail & Related papers (2024-10-15T01:00:43Z) - A Universal Non-Parametric Approach For Improved Molecular Sequence
Analysis [4.588028371034407]
We present a novel approach based on the compression-based Model, motivated from citejiang2023low.
We compress the molecular sequence using well-known compression algorithms, such as Gzip and Bz2.
Next, we employ kernel Principal Component Analysis (PCA) to get the vector representations for the corresponding molecular sequence.
arXiv Detail & Related papers (2024-02-12T23:15:16Z) - Spectral Entry-wise Matrix Estimation for Low-Rank Reinforcement
Learning [53.445068584013896]
We study matrix estimation problems arising in reinforcement learning (RL) with low-rank structure.
In low-rank bandits, the matrix to be recovered specifies the expected arm rewards, and for low-rank Markov Decision Processes (MDPs), it may for example characterize the transition kernel of the MDP.
We show that simple spectral-based matrix estimation approaches efficiently recover the singular subspaces of the matrix and exhibit nearly-minimal entry-wise error.
arXiv Detail & Related papers (2023-10-10T17:06:41Z) - Deep Unrolling for Nonconvex Robust Principal Component Analysis [75.32013242448151]
We design algorithms for Robust Component Analysis (A)
It consists in decomposing a matrix into the sum of a low Principaled matrix and a sparse Principaled matrix.
arXiv Detail & Related papers (2023-07-12T03:48:26Z) - BioSequence2Vec: Efficient Embedding Generation For Biological Sequences [1.0896567381206714]
We propose a general-purpose representation learning approach that embodies kernel methods' qualities while avoiding computation, memory, and generalizability challenges.
Our proposed fast and alignment-free embedding method can be used as input to any distance.
We perform a variety of real-world classification tasks, such as SARS-CoV-2 lineage and gene family classification, outperforming several state-of-the-art embedding and kernel methods in predictive performance.
arXiv Detail & Related papers (2023-04-01T10:58:21Z) - Tree-based Implementation of the Small Matrix Path Integral for
System-Bath Dynamics [0.0]
The t-SMatPI algorithm is shown to be much faster than straightforward computation of the kernel matrices based on their definitions.
Our method may indicate some new properties of open quantum systems, and has the potential to be generalized to higher-order numerical schemes.
arXiv Detail & Related papers (2022-07-24T22:21:10Z) - Adaptive Explicit Kernel Minkowski Weighted K-means [1.3535770763481905]
The kernel K-means, which extends K-means into the kernel space, is able to capture nonlinear structures and identify arbitrarily shaped clusters.
This paper proposes a method to combine the advantages of the linear and nonlinear approaches by using driven corresponding approximate finite-dimensional feature maps.
arXiv Detail & Related papers (2020-12-04T19:14:09Z) - Kernel learning approaches for summarising and combining posterior
similarity matrices [68.8204255655161]
We build upon the notion of the posterior similarity matrix (PSM) in order to suggest new approaches for summarising the output of MCMC algorithms for Bayesian clustering models.
A key contribution of our work is the observation that PSMs are positive semi-definite, and hence can be used to define probabilistically-motivated kernel matrices.
arXiv Detail & Related papers (2020-09-27T14:16:14Z) - SimpleMKKM: Simple Multiple Kernel K-means [49.500663154085586]
We propose a simple yet effective multiple kernel clustering algorithm, termed simple multiple kernel k-means (SimpleMKKM)
Our criterion is given by an intractable minimization-maximization problem in the kernel coefficient and clustering partition matrix.
We theoretically analyze the performance of SimpleMKKM in terms of its clustering generalization error.
arXiv Detail & Related papers (2020-05-11T10:06:40Z) - Augmentation of the Reconstruction Performance of Fuzzy C-Means with an
Optimized Fuzzification Factor Vector [99.19847674810079]
Fuzzy C-Means (FCM) is one of the most frequently used methods to construct information granules.
In this paper, we augment the FCM-based degranulation mechanism by introducing a vector of fuzzification factors.
Experiments completed for both synthetic and publicly available datasets show that the proposed approach outperforms the generic data reconstruction approach.
arXiv Detail & Related papers (2020-04-13T04:17:30Z) - Improved guarantees and a multiple-descent curve for Column Subset
Selection and the Nystr\"om method [76.73096213472897]
We develop techniques which exploit spectral properties of the data matrix to obtain improved approximation guarantees.
Our approach leads to significantly better bounds for datasets with known rates of singular value decay.
We show that both our improved bounds and the multiple-descent curve can be observed on real datasets simply by varying the RBF parameter.
arXiv Detail & Related papers (2020-02-21T00:43:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.