A Universal Non-Parametric Approach For Improved Molecular Sequence
Analysis
- URL: http://arxiv.org/abs/2402.08117v1
- Date: Mon, 12 Feb 2024 23:15:16 GMT
- Title: A Universal Non-Parametric Approach For Improved Molecular Sequence
Analysis
- Authors: Sarwan Ali, Tamkanat E Ali, Prakash Chourasia, Murray Patterson
- Abstract summary: We present a novel approach based on the compression-based Model, motivated from citejiang2023low.
We compress the molecular sequence using well-known compression algorithms, such as Gzip and Bz2.
Next, we employ kernel Principal Component Analysis (PCA) to get the vector representations for the corresponding molecular sequence.
- Score: 4.588028371034407
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In the field of biological research, it is essential to comprehend the
characteristics and functions of molecular sequences. The classification of
molecular sequences has seen widespread use of neural network-based techniques.
Despite their astounding accuracy, these models often require a substantial
number of parameters and more data collection. In this work, we present a novel
approach based on the compression-based Model, motivated from
\cite{jiang2023low}, which combines the simplicity of basic compression
algorithms like Gzip and Bz2, with Normalized Compression Distance (NCD)
algorithm to achieve better performance on classification tasks without relying
on handcrafted features or pre-trained models. Firstly, we compress the
molecular sequence using well-known compression algorithms, such as Gzip and
Bz2. By leveraging the latent structure encoded in compressed files, we compute
the Normalized Compression Distance between each pair of molecular sequences,
which is derived from the Kolmogorov complexity. This gives us a distance
matrix, which is the input for generating a kernel matrix using a Gaussian
kernel. Next, we employ kernel Principal Component Analysis (PCA) to get the
vector representations for the corresponding molecular sequence, capturing
important structural and functional information. The resulting vector
representations provide an efficient yet effective solution for molecular
sequence analysis and can be used in ML-based downstream tasks. The proposed
approach eliminates the need for computationally intensive Deep Neural Networks
(DNNs), with their large parameter counts and data requirements. Instead, it
leverages a lightweight and universally accessible compression-based model.
Related papers
- Quantization of Large Language Models with an Overdetermined Basis [73.79368761182998]
We introduce an algorithm for data quantization based on the principles of Kashin representation.
Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance.
arXiv Detail & Related papers (2024-04-15T12:38:46Z) - HD-Bind: Encoding of Molecular Structure with Low Precision,
Hyperdimensional Binary Representations [3.3934198248179026]
Hyperdimensional Computing (HDC) is a proposed learning paradigm that is able to leverage low-precision binary vector arithmetic.
We show that HDC-based inference methods are as much as 90 times more efficient than more complex representative machine learning methods.
arXiv Detail & Related papers (2023-03-27T21:21:46Z) - Multiresolution Graph Transformers and Wavelet Positional Encoding for
Learning Hierarchical Structures [6.875312133832078]
We propose Multiresolution Graph Transformers (MGT), the first graph transformer architecture that can learn to represent large molecules at multiple scales.
MGT can learn to produce representations for the atoms and group them into meaningful functional groups or repeating units.
Our proposed model achieves results on two macromolecule datasets consisting of polymers and peptides, and one drug-like molecule dataset.
arXiv Detail & Related papers (2023-02-17T01:32:44Z) - Linear-scaling kernels for protein sequences and small molecules
outperform deep learning while providing uncertainty quantitation and
improved interpretability [5.623232537411766]
We develop efficient and scalable approaches for fitting GP models and fast convolution kernels.
We implement these improvements by building an open-source Python library called xGPR.
We show that xGPR generally outperforms convolutional neural networks on predicting key properties of proteins and small molecules.
arXiv Detail & Related papers (2023-02-07T07:06:02Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - COIN++: Data Agnostic Neural Compression [55.27113889737545]
COIN++ is a neural compression framework that seamlessly handles a wide range of data modalities.
We demonstrate the effectiveness of our method by compressing various data modalities.
arXiv Detail & Related papers (2022-01-30T20:12:04Z) - On minimizers and convolutional filters: theoretical connections and
applications to genome analysis [2.8282906214258805]
CNNs start with a wide array of randomly convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how they can be used to classify the sequence.
In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres.
We train from scratch a CNN embedding of synthetic short-reads from the SARS-CoV-2 genome into 3D Euclidean space that locally recapitulates the linear sequence distance of the read origins.
arXiv Detail & Related papers (2021-11-09T19:02:04Z) - Even more efficient quantum computations of chemistry through tensor
hypercontraction [0.6234350105794442]
We describe quantum circuits with only $widetildecal O(N)$ Toffoli complexity that block encode the spectra of quantum chemistry Hamiltonians in a basis of $N$ arbitrary orbitals.
This is the lowest complexity that has been shown for quantum computations of chemistry within an arbitrary basis.
arXiv Detail & Related papers (2020-11-06T18:03:29Z) - Connecting Weighted Automata, Tensor Networks and Recurrent Neural
Networks through Spectral Learning [58.14930566993063]
We present connections between three models used in different research fields: weighted finite automata(WFA) from formal languages and linguistics, recurrent neural networks used in machine learning, and tensor networks.
We introduce the first provable learning algorithm for linear 2-RNN defined over sequences of continuous vectors input.
arXiv Detail & Related papers (2020-10-19T15:28:00Z) - MIMOSA: Multi-constraint Molecule Sampling for Molecule Optimization [51.00815310242277]
generative models and reinforcement learning approaches made initial success, but still face difficulties in simultaneously optimizing multiple drug properties.
We propose the MultI-constraint MOlecule SAmpling (MIMOSA) approach, a sampling framework to use input molecule as an initial guess and sample molecules from the target distribution.
arXiv Detail & Related papers (2020-10-05T20:18:42Z) - Multipole Graph Neural Operator for Parametric Partial Differential
Equations [57.90284928158383]
One of the main challenges in using deep learning-based methods for simulating physical systems is formulating physics-based data.
We propose a novel multi-level graph neural network framework that captures interaction at all ranges with only linear complexity.
Experiments confirm our multi-graph network learns discretization-invariant solution operators to PDEs and can be evaluated in linear time.
arXiv Detail & Related papers (2020-06-16T21:56:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.