Related papers: A Simple Method for PMF Estimation on Large Supports

A Simple Method for PMF Estimation on Large Supports

URL: http://arxiv.org/abs/2510.15132v1
Date: Thu, 16 Oct 2025 20:47:40 GMT
Title: A Simple Method for PMF Estimation on Large Supports
Authors: Alex Shtoff,
Abstract summary: We study nonparametric estimation of a probability mass function (PMF) on a large discrete support, where the PMF is multi-modal and heavy-tailed.<n>Projecting the empirical PMF onto this low dimensional subspace produces a smooth, multi-modal estimate that preserves coarse structure while suppressing noise.<n>The method is short to implement, robust across sample sizes, and suitable for automated pipelines and exploratory analysis at scale because of its reliability and speed.
Score: 0.7163391346004578
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study nonparametric estimation of a probability mass function (PMF) on a large discrete support, where the PMF is multi-modal and heavy-tailed. The core idea is to treat the empirical PMF as a signal on a line graph and apply a data-dependent low-pass filter. Concretely, we form a symmetric tri-diagonal operator, the path graph Laplacian perturbed with a diagonal matrix built from the empirical PMF, then compute the eigenvectors, corresponding to the smallest feq eigenvalues. Projecting the empirical PMF onto this low dimensional subspace produces a smooth, multi-modal estimate that preserves coarse structure while suppressing noise. A light post-processing step of clipping and re-normalizing yields a valid PMF. Because we compute the eigenpairs of a symmetric tridiagonal matrix, the computation is reliable and runs time and memory proportional to the support times the dimension of the desired low-dimensional supspace. We also provide a practical, data-driven rule for selecting the dimension based on an orthogonal-series risk estimate, so the method "just works" with minimal tuning. On synthetic and real heavy-tailed examples, the approach preserves coarse structure while suppressing sampling noise, compares favorably to logspline and Gaussian-KDE baselines in the intended regimes. However, it has known failure modes (e.g., abrupt discontinuities). The method is short to implement, robust across sample sizes, and suitable for automated pipelines and exploratory analysis at scale because of its reliability and speed.

Related papers

Tuning-Free Structured Sparse Recovery of Multiple Measurement Vectors using Implicit Regularization [13.378211527081582]
We introduce a tuning-free framework to recover sparse signals in multiple measurement vectors.<n>We show that the optimization dynamics exhibit a "momentum-like" effect, causing the norms of rows in the true support to grow significantly faster than others.
arXiv Detail & Related papers (2025-12-03T02:53:11Z)
Provable Non-Convex Euclidean Distance Matrix Completion: Geometry, Reconstruction, and Robustness [8.113729514518495]
The Euclidean Distance Matrix Completion problem arises in a broad range of applications, including sensor network localization, molecular robustness, and manifold learning.<n>In this paper, we propose a low-rank matrix completion task over the space of positive semi-definite Gram matrices.<n>The available distance measurements are encoded as expansion coefficients in a non-orthogonal basis, and optimization over the Gram matrix implicitly enforces geometric consistency through nonnegativity and the triangle inequality.
arXiv Detail & Related papers (2025-07-31T18:40:42Z)
Euclidean Distance Matrix Completion via Asymmetric Projected Gradient Descent [25.846262685970164]
This paper proposes and analyzes a gradient-type algorithm based on Burer-Monteiro factorization.<n>It reconstructs the point set configuration from partial Euclidean distance measurements.
arXiv Detail & Related papers (2025-04-28T07:13:23Z)
A Bayesian Approach Toward Robust Multidimensional Ellipsoid-Specific Fitting [0.0]
This work presents a novel and effective method for fitting multidimensional ellipsoids to scattered data in the contamination of noise and outliers. We incorporate a uniform prior distribution to constrain the search for primitive parameters within an ellipsoidal domain. We apply it to a wide range of practical applications such as microscopy cell counting, 3D reconstruction, geometric shape approximation, and magnetometer calibration tasks.
arXiv Detail & Related papers (2024-07-27T14:31:51Z)
Large-scale gradient-based training of Mixtures of Factor Analyzers [67.21722742907981]
This article contributes both a theoretical analysis as well as a new method for efficient high-dimensional training by gradient descent. We prove that MFA training and inference/sampling can be performed based on precision matrices, which does not require matrix inversions after training is completed. Besides the theoretical analysis and matrices, we apply MFA to typical image datasets such as SVHN and MNIST, and demonstrate the ability to perform sample generation and outlier detection.
arXiv Detail & Related papers (2023-08-26T06:12:33Z)
Low-rank extended Kalman filtering for online learning of neural networks from streaming data [71.97861600347959]
We propose an efficient online approximate Bayesian inference algorithm for estimating the parameters of a nonlinear function from a potentially non-stationary data stream. The method is based on the extended Kalman filter (EKF), but uses a novel low-rank plus diagonal decomposition of the posterior matrix. In contrast to methods based on variational inference, our method is fully deterministic, and does not require step-size tuning.
arXiv Detail & Related papers (2023-05-31T03:48:49Z)
Sparse high-dimensional linear regression with a partitioned empirical Bayes ECM algorithm [62.997667081978825]
We propose a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression. Minimal prior assumptions on the parameters are used through the use of plug-in empirical Bayes estimates. The proposed approach is implemented in the R package probe.
arXiv Detail & Related papers (2022-09-16T19:15:50Z)
Efficient CDF Approximations for Normalizing Flows [64.60846767084877]
We build upon the diffeomorphic properties of normalizing flows to estimate the cumulative distribution function (CDF) over a closed region. Our experiments on popular flow architectures and UCI datasets show a marked improvement in sample efficiency as compared to traditional estimators.
arXiv Detail & Related papers (2022-02-23T06:11:49Z)
Robust Principal Component Analysis: A Median of Means Approach [17.446104539598895]
Principal Component Analysis is a tool for data visualization, denoising, and dimensionality reduction. Recent supervised learning methods have shown great success in dealing with outlying observations. This paper proposes a PCA procedure based on the MoM principle.
arXiv Detail & Related papers (2021-02-05T19:59:05Z)
Stochastic Approximation for Online Tensorial Independent Component Analysis [98.34292831923335]
Independent component analysis (ICA) has been a popular dimension reduction tool in statistical machine learning and signal processing. In this paper, we present a by-product online tensorial algorithm that estimates for each independent component.
arXiv Detail & Related papers (2020-12-28T18:52:37Z)
Independent finite approximations for Bayesian nonparametric inference [30.367795444044788]
We propose a recipe to construct practical finite-dimensional approximations for homogeneous random measures. We upper bound the approximation error of AIFAs for a wide class of common CRMs and NCRMs. We prove that, for worst-case choices of observation likelihoods, TFAs are more efficient than AIFAs.
arXiv Detail & Related papers (2020-09-22T19:37:21Z)
Understanding Implicit Regularization in Over-Parameterized Single Index Model [55.41685740015095]
We design regularization-free algorithms for the high-dimensional single index model. We provide theoretical guarantees for the induced implicit regularization phenomenon.
arXiv Detail & Related papers (2020-07-16T13:27:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.