Sampling in Dirichlet Process Mixture Models for Clustering Streaming
Data
- URL: http://arxiv.org/abs/2202.13312v1
- Date: Sun, 27 Feb 2022 08:51:50 GMT
- Title: Sampling in Dirichlet Process Mixture Models for Clustering Streaming
Data
- Authors: Or Dinari and Oren Freifeld
- Abstract summary: Dirichlet Process Mixture Model (DPMM) seems a natural choice for the streaming-data case.
Existing methods for online DPMM inference are too slow to handle rapid data streams.
We propose adapting both the DPMM and a known DPMM sampling-based non-streaming inference method for streaming-data clustering.
- Score: 5.660207256468972
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Practical tools for clustering streaming data must be fast enough to handle
the arrival rate of the observations. Typically, they also must adapt on the
fly to possible lack of stationarity; i.e., the data statistics may be
time-dependent due to various forms of drifts, changes in the number of
clusters, etc. The Dirichlet Process Mixture Model (DPMM), whose Bayesian
nonparametric nature allows it to adapt its complexity to the data, seems a
natural choice for the streaming-data case. In its classical formulation,
however, the DPMM cannot capture common types of drifts in the data statistics.
Moreover, and regardless of that limitation, existing methods for online DPMM
inference are too slow to handle rapid data streams. In this work we propose
adapting both the DPMM and a known DPMM sampling-based non-streaming inference
method for streaming-data clustering. We demonstrate the utility of the
proposed method on several challenging settings, where it obtains
state-of-the-art results while being on par with other methods in terms of
speed.
Related papers
- IncA-DES: An incremental and adaptive dynamic ensemble selection approach using online K-d tree neighborhood search for data streams with concept drift [6.6364343000413815]
IncA-DES employs a training strategy that promotes the generation of local experts.<n>Online K-d tree algorithm can quickly remove instances without becoming inconsistent.<n>Proposed framework got the best average accuracy compared to seven state-of-the-art methods.
arXiv Detail & Related papers (2025-07-16T18:42:12Z) - OASIS: Online Sample Selection for Continual Visual Instruction Tuning [55.92362550389058]
In continual instruction tuning (CIT) scenarios, new instruction tuning data continuously arrive in an online streaming manner.<n>Data selection can mitigate this overhead, but existing strategies often rely on pretrained reference models.<n>Recent reference model-free online sample selection methods address this, but typically select a fixed number of samples per batch.
arXiv Detail & Related papers (2025-05-27T20:32:43Z) - Sequential Order-Robust Mamba for Time Series Forecasting [5.265578815577529]
Mamba has emerged as a promising alternative to Transformers, offering near-linear complexity in processing sequential data.
We propose SOR-Mamba, a TS forecasting method that incorporates a regularization strategy to minimize the discrepancy between two embedding vectors generated from data with reversed channel orders.
We also introduce channel correlation modeling (CCM), a pretraining task aimed at preserving correlations between channels from the data space to the latent space in order to enhance the ability to capture CD.
arXiv Detail & Related papers (2024-10-30T18:05:22Z) - Semi-Supervised Model-Free Bayesian State Estimation from Compressed Measurements [57.04370580292727]
We consider data-driven Bayesian state estimation from compressed measurements.
The dimension of the temporal measurement vector is lower than that of the temporal state vector to be estimated.
The underlying dynamical model of the state's evolution is unknown for a'model-free process'
arXiv Detail & Related papers (2024-07-10T05:03:48Z) - PeFAD: A Parameter-Efficient Federated Framework for Time Series Anomaly Detection [51.20479454379662]
We propose a.
Federated Anomaly Detection framework named PeFAD with the increasing privacy concerns.
We conduct extensive evaluations on four real datasets, where PeFAD outperforms existing state-of-the-art baselines by up to 28.74%.
arXiv Detail & Related papers (2024-06-04T13:51:08Z) - Online Variational Sequential Monte Carlo [49.97673761305336]
We build upon the variational sequential Monte Carlo (VSMC) method, which provides computationally efficient and accurate model parameter estimation and Bayesian latent-state inference.
Online VSMC is capable of performing efficiently, entirely on-the-fly, both parameter estimation and particle proposal adaptation.
arXiv Detail & Related papers (2023-12-19T21:45:38Z) - Distributed Collapsed Gibbs Sampler for Dirichlet Process Mixture Models
in Federated Learning [0.22499166814992444]
This paper proposes a new distributed Markov Chain Monte Carlo (MCMC) inference method for DPMMs (DisCGS) using sufficient statistics.
Our approach uses the collapsed Gibbs sampler and is specifically designed to work on distributed data across independent and heterogeneous machines.
For instance, with a dataset of 100K data points, the centralized algorithm requires approximately 12 hours to complete 100 iterations while our approach achieves the same number of iterations in just 3 minutes.
arXiv Detail & Related papers (2023-12-18T13:16:18Z) - A parsimonious, computationally efficient machine learning method for
spatial regression [0.0]
We introduce the modified planar rotator method (MPRS), a physically inspired machine learning method for spatial/temporal regression.
MPRS is a non-parametric model which incorporates spatial or temporal correlations via short-range, distance-dependent interactions'' without assuming a specific form for the underlying probability distribution.
arXiv Detail & Related papers (2023-09-28T13:57:36Z) - Personalized Federated Learning under Mixture of Distributions [98.25444470990107]
We propose a novel approach to Personalized Federated Learning (PFL), which utilizes Gaussian mixture models (GMM) to fit the input data distributions across diverse clients.
FedGMM possesses an additional advantage of adapting to new clients with minimal overhead, and it also enables uncertainty quantification.
Empirical evaluations on synthetic and benchmark datasets demonstrate the superior performance of our method in both PFL classification and novel sample detection.
arXiv Detail & Related papers (2023-05-01T20:04:46Z) - On Calibrating Diffusion Probabilistic Models [78.75538484265292]
diffusion probabilistic models (DPMs) have achieved promising results in diverse generative tasks.
We propose a simple way for calibrating an arbitrary pretrained DPM, with which the score matching loss can be reduced and the lower bounds of model likelihood can be increased.
Our calibration method is performed only once and the resulting models can be used repeatedly for sampling.
arXiv Detail & Related papers (2023-02-21T14:14:40Z) - Model-based recursive partitioning for discrete event times [3.222802562733787]
We propose MOB for discrete Survival data (MOB-dS) which controls the type I error rate of the test used for data splitting.
We find that the type I error rates of the test is well controlled for MOB-dS, but observe some considerable inflations of the error rate for MOB.
arXiv Detail & Related papers (2022-09-14T12:17:56Z) - Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge
Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles.
Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center.
We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes.
A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z) - Semi-Supervised Learning with Normalizing Flows [54.376602201489995]
FlowGMM is an end-to-end approach to generative semi supervised learning with normalizing flows.
We show promising results on a wide range of applications, including AG-News and Yahoo Answers text data.
arXiv Detail & Related papers (2019-12-30T17:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.