Mixture models for data with unknown distributions
- URL: http://arxiv.org/abs/2502.19605v1
- Date: Wed, 26 Feb 2025 22:42:40 GMT
- Title: Mixture models for data with unknown distributions
- Authors: M. E. J. Newman,
- Abstract summary: We describe and analyze a broad class of mixture models for real-valued multivariate data.<n>We return both a division of the data and an estimate of the distributions, effectively performing clustering and density estimation within each cluster at the same time.<n>We demonstrate our methods with a selection of illustrative applications and give code implementing both algorithms.
- Score: 0.6345523830122168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe and analyze a broad class of mixture models for real-valued multivariate data in which the probability density of observations within each component of the model is represented as an arbitrary combination of basis functions. Fits to these models give us a way to cluster data with distributions of unknown form, including strongly non-Gaussian or multimodal distributions, and return both a division of the data and an estimate of the distributions, effectively performing clustering and density estimation within each cluster at the same time. We describe two fitting methods, one using an expectation-maximization (EM) algorithm and the other a Bayesian non-parametric method using a collapsed Gibbs sampler. The former is numerically efficient, but gives only point estimates of the probability densities. The latter is more computationally demanding but returns a full Bayesian posterior and also an estimate of the number of components. We demonstrate our methods with a selection of illustrative applications and give code implementing both algorithms.
Related papers
- Cluster weighted models with multivariate skewed distributions for functional data [0.0]
We propose a clustering method, funWeightClustSkew, based on mixtures of functional linear regression models and three skewed multivariate distributions.
Our approach follows the framework of the functional high dimensional data clustering (funHDDC) method.
We illustrate the performance of funWeightlustClustSkew for simulated data and for the Air Quality dataset.
arXiv Detail & Related papers (2025-04-17T06:17:06Z) - Fusion of Gaussian Processes Predictions with Monte Carlo Sampling [61.31380086717422]
In science and engineering, we often work with models designed for accurate prediction of variables of interest.
Recognizing that these models are approximations of reality, it becomes desirable to apply multiple models to the same data and integrate their outcomes.
arXiv Detail & Related papers (2024-03-03T04:21:21Z) - Empirical Density Estimation based on Spline Quasi-Interpolation with
applications to Copulas clustering modeling [0.0]
Density estimation is a fundamental technique employed in various fields to model and to understand the underlying distribution of data.
In this paper we propose the mono-variate approximation of the density using quasi-interpolation.
The presented algorithm is validated on artificial and real datasets.
arXiv Detail & Related papers (2024-02-18T11:49:38Z) - PQMass: Probabilistic Assessment of the Quality of Generative Models
using Probability Mass Estimation [8.527898482146103]
We propose a comprehensive sample-based method for assessing the quality of generative models.
The proposed approach enables the estimation of the probability that two sets of samples are drawn from the same distribution.
arXiv Detail & Related papers (2024-02-06T19:39:26Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - A Robust and Flexible EM Algorithm for Mixtures of Elliptical
Distributions with Missing Data [71.9573352891936]
This paper tackles the problem of missing data imputation for noisy and non-Gaussian data.
A new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data.
Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data.
arXiv Detail & Related papers (2022-01-28T10:01:37Z) - Sampling from Arbitrary Functions via PSD Models [55.41644538483948]
We take a two-step approach by first modeling the probability distribution and then sampling from that model.
We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models.
arXiv Detail & Related papers (2021-10-20T12:25:22Z) - A similarity-based Bayesian mixture-of-experts model [0.5156484100374058]
We present a new non-parametric mixture-of-experts model for multivariate regression problems.
Using a conditionally specified model, predictions for out-of-sample inputs are based on similarities to each observed data point.
Posterior inference is performed on the parameters of the mixture as well as the distance metric.
arXiv Detail & Related papers (2020-12-03T18:08:30Z) - Kernel learning approaches for summarising and combining posterior
similarity matrices [68.8204255655161]
We build upon the notion of the posterior similarity matrix (PSM) in order to suggest new approaches for summarising the output of MCMC algorithms for Bayesian clustering models.
A key contribution of our work is the observation that PSMs are positive semi-definite, and hence can be used to define probabilistically-motivated kernel matrices.
arXiv Detail & Related papers (2020-09-27T14:16:14Z) - Model Fusion with Kullback--Leibler Divergence [58.20269014662046]
We propose a method to fuse posterior distributions learned from heterogeneous datasets.
Our algorithm relies on a mean field assumption for both the fused model and the individual dataset posteriors.
arXiv Detail & Related papers (2020-07-13T03:27:45Z) - Handling missing data in model-based clustering [0.0]
We propose two methods to fit Gaussian mixtures in the presence of missing data.
Both methods use a variant of the Monte Carlo Expectation-Maximisation algorithm for data augmentation.
We show that the proposed methods outperform the multiple imputation approach, both in terms of clusters identification and density estimation.
arXiv Detail & Related papers (2020-06-04T15:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.