Levenshtein Distance Embedding with Poisson Regression for DNA Storage
- URL: http://arxiv.org/abs/2312.07931v1
- Date: Wed, 13 Dec 2023 07:20:27 GMT
- Title: Levenshtein Distance Embedding with Poisson Regression for DNA Storage
- Authors: Xiang Wei, Alan J.X. Guo, Sihan Sun, Mengyi Wei, Wei Yu
- Abstract summary: Sequence embedding maps Levenshtein distance to a conventional distance between embedding vectors.
In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed.
We demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.
- Score: 8.943376293527114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficient computation or approximation of Levenshtein distance, a widely-used
metric for evaluating sequence similarity, has attracted significant attention
with the emergence of DNA storage and other biological applications. Sequence
embedding, which maps Levenshtein distance to a conventional distance between
embedding vectors, has emerged as a promising solution. In this paper, a novel
neural network-based sequence embedding technique using Poisson regression is
proposed. We first provide a theoretical analysis of the impact of embedding
dimension on model performance and present a criterion for selecting an
appropriate embedding dimension. Under this embedding dimension, the Poisson
regression is introduced by assuming the Levenshtein distance between sequences
of fixed length following a Poisson distribution, which naturally aligns with
the definition of Levenshtein distance. Moreover, from the perspective of the
distribution of embedding distances, Poisson regression approximates the
negative log likelihood of the chi-squared distribution and offers advancements
in removing the skewness. Through comprehensive experiments on real DNA storage
data, we demonstrate the superior performance of the proposed method compared
to state-of-the-art approaches.
Related papers
- Convergence of Score-Based Discrete Diffusion Models: A Discrete-Time Analysis [56.442307356162864]
We study the theoretical aspects of score-based discrete diffusion models under the Continuous Time Markov Chain (CTMC) framework.
We introduce a discrete-time sampling algorithm in the general state space $[S]d$ that utilizes score estimators at predefined time points.
Our convergence analysis employs a Girsanov-based method and establishes key properties of the discrete score function.
arXiv Detail & Related papers (2024-10-03T09:07:13Z) - Distributed Markov Chain Monte Carlo Sampling based on the Alternating
Direction Method of Multipliers [143.6249073384419]
In this paper, we propose a distributed sampling scheme based on the alternating direction method of multipliers.
We provide both theoretical guarantees of our algorithm's convergence and experimental evidence of its superiority to the state-of-the-art.
In simulation, we deploy our algorithm on linear and logistic regression tasks and illustrate its fast convergence compared to existing gradient-based methods.
arXiv Detail & Related papers (2024-01-29T02:08:40Z) - Sampling and estimation on manifolds using the Langevin diffusion [45.57801520690309]
Two estimators of linear functionals of $mu_phi $ based on the discretized Markov process are considered.
Error bounds are derived for sampling and estimation using a discretization of an intrinsically defined Langevin diffusion.
arXiv Detail & Related papers (2023-12-22T18:01:11Z) - Conformal inference for regression on Riemannian Manifolds [49.7719149179179]
We investigate prediction sets for regression scenarios when the response variable, denoted by $Y$, resides in a manifold, and the covariable, denoted by X, lies in Euclidean space.
We prove the almost sure convergence of the empirical version of these regions on the manifold to their population counterparts.
arXiv Detail & Related papers (2023-10-12T10:56:25Z) - Positive definite nonparametric regression using an evolutionary
algorithm with application to covariance function estimation [0.0]
We propose a novel nonparametric regression framework for estimating covariance functions of stationary processes.
Our method can impose positive definiteness, as well as isotropy and monotonicity, on the estimators.
Our method provides more reliable estimates for long-range dependence.
arXiv Detail & Related papers (2023-04-25T22:01:14Z) - Score Approximation, Estimation and Distribution Recovery of Diffusion
Models on Low-Dimensional Data [68.62134204367668]
This paper studies score approximation, estimation, and distribution recovery of diffusion models, when data are supported on an unknown low-dimensional linear subspace.
We show that with a properly chosen neural network architecture, the score function can be both accurately approximated and efficiently estimated.
The generated distribution based on the estimated score function captures the data geometric structures and converges to a close vicinity of the data distribution.
arXiv Detail & Related papers (2023-02-14T17:02:35Z) - Kernel-based off-policy estimation without overlap: Instance optimality
beyond semiparametric efficiency [53.90687548731265]
We study optimal procedures for estimating a linear functional based on observational data.
For any convex and symmetric function class $mathcalF$, we derive a non-asymptotic local minimax bound on the mean-squared error.
arXiv Detail & Related papers (2023-01-16T02:57:37Z) - Density Estimation with Autoregressive Bayesian Predictives [1.5771347525430772]
In the context of density estimation, the standard Bayesian approach is to target the posterior predictive.
We develop a novel parameterization of the bandwidth using an autoregressive neural network that maps the data into a latent space.
arXiv Detail & Related papers (2022-06-13T20:43:39Z) - Stationary Density Estimation of It\^o Diffusions Using Deep Learning [6.8342505943533345]
We consider the density estimation problem associated with the stationary measure of ergodic Ito diffusions from a discrete-time series.
We employ deep neural networks to approximate the drift and diffusion terms of the SDE.
We establish the convergence of the proposed scheme under appropriate mathematical assumptions.
arXiv Detail & Related papers (2021-09-09T01:57:14Z) - Instance-Optimal Compressed Sensing via Posterior Sampling [101.43899352984774]
We show for Gaussian measurements and emphany prior distribution on the signal, that the posterior sampling estimator achieves near-optimal recovery guarantees.
We implement the posterior sampling estimator for deep generative priors using Langevin dynamics, and empirically find that it produces accurate estimates with more diversity than MAP.
arXiv Detail & Related papers (2021-06-21T22:51:56Z) - Statistical Inference after Kernel Ridge Regression Imputation under
item nonresponse [0.76146285961466]
We consider a nonparametric approach to imputation using the kernel ridge regression technique and propose consistent variance estimation.
The proposed variance estimator is based on a linearization approach which employs the entropy method to estimate the density ratio.
arXiv Detail & Related papers (2021-01-29T20:46:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.