Deep Squared Euclidean Approximation to the Levenshtein Distance for DNA
Storage
- URL: http://arxiv.org/abs/2207.04684v1
- Date: Mon, 11 Jul 2022 07:59:36 GMT
- Title: Deep Squared Euclidean Approximation to the Levenshtein Distance for DNA
Storage
- Authors: Alan J.X. Guo, Cong Liang, Qing-Hu Hou
- Abstract summary: Levenshtein distance is the most suitable metric on the similarity between two DNA sequences.
We propose a novel deep squared Euclidean embedding for DNA sequences using Siamese neural network, squared Euclidean embedding, and chi-squared regression.
- Score: 4.447467536572626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Storing information in DNA molecules is of great interest because of its
advantages in longevity, high storage density, and low maintenance cost. A key
step in the DNA storage pipeline is to efficiently cluster the retrieved DNA
sequences according to their similarities. Levenshtein distance is the most
suitable metric on the similarity between two DNA sequences, but it is inferior
in terms of computational complexity and less compatible with mature clustering
algorithms. In this work, we propose a novel deep squared Euclidean embedding
for DNA sequences using Siamese neural network, squared Euclidean embedding,
and chi-squared regression. The Levenshtein distance is approximated by the
squared Euclidean distance between the embedding vectors, which is fast
calculated and clustering algorithm friendly. The proposed approach is analyzed
theoretically and experimentally. The results show that the proposed embedding
is efficient and robust.
Related papers
- Fast unsupervised ground metric learning with tree-Wasserstein distance [14.235762519615175]
unsupervised ground metric learning approaches have been introduced.
We propose to augment the WSV method by embedding samples and features on trees, on which we compute the tree-Wasserstein distance (TWD)
We demonstrate theoretically and empirically that the algorithm converges to a better approximation of the full WSV approach than the best known alternatives, and does so with $mathcalO(n3)$ complexity.
arXiv Detail & Related papers (2024-11-11T23:21:01Z) - Levenshtein Distance Embedding with Poisson Regression for DNA Storage [8.943376293527114]
Sequence embedding maps Levenshtein distance to a conventional distance between embedding vectors.
In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed.
We demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.
arXiv Detail & Related papers (2023-12-13T07:20:27Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Linearized Wasserstein dimensionality reduction with approximation
guarantees [65.16758672591365]
LOT Wassmap is a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space.
We show that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size.
We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.
arXiv Detail & Related papers (2023-02-14T22:12:16Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences.
Exact methods yield better classification performance, but they pose high computational costs.
We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z) - A density peaks clustering algorithm with sparse search and K-d tree [16.141611031128427]
Density peaks clustering algorithm with sparse search and K-d tree is developed to solve this problem.
Experiments are carried out on datasets with different distribution characteristics, by comparing with other five typical clustering algorithms.
arXiv Detail & Related papers (2022-03-02T09:29:40Z) - Density-Based Clustering with Kernel Diffusion [59.4179549482505]
A naive density corresponding to the indicator function of a unit $d$-dimensional Euclidean ball is commonly used in density-based clustering algorithms.
We propose a new kernel diffusion density function, which is adaptive to data of varying local distributional characteristics and smoothness.
arXiv Detail & Related papers (2021-10-11T09:00:33Z) - Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network.
We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z) - Efficient approximation of DNA hybridisation using deep learning [0.0]
We present the first comprehensive study of machine learning methods applied to the task of predicting DNA hybridisation.
We introduce a synthetic hybridisation dataset of over 2.5 million data points, enabling the use of a wide range of machine learning algorithms.
arXiv Detail & Related papers (2021-02-19T19:23:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.