Coresets for Time Series Clustering
- URL: http://arxiv.org/abs/2110.15263v1
- Date: Thu, 28 Oct 2021 16:21:13 GMT
- Title: Coresets for Time Series Clustering
- Authors: Lingxiao Huang, K. Sudhir, Nisheeth K. Vishnoi
- Abstract summary: We study the problem of constructing coresets for clustering problems with time series data.
Our main contribution is an algorithm to construct coresets for a mixture model.
We empirically assess the performance of our coreset with synthetic data.
- Score: 33.801060211529354
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of constructing coresets for clustering problems with
time series data. This problem has gained importance across many fields
including biology, medicine, and economics due to the proliferation of sensors
facilitating real-time measurement and rapid drop in storage costs. In
particular, we consider the setting where the time series data on $N$ entities
is generated from a Gaussian mixture model with autocorrelations over $k$
clusters in $\mathbb{R}^d$. Our main contribution is an algorithm to construct
coresets for the maximum likelihood objective for this mixture model. Our
algorithm is efficient, and under a mild boundedness assumption on the
covariance matrices of the underlying Gaussians, the size of the coreset is
independent of the number of entities $N$ and the number of observations for
each entity, and depends only polynomially on $k$, $d$ and $1/\varepsilon$,
where $\varepsilon$ is the error parameter. We empirically assess the
performance of our coreset with synthetic data.
Related papers
- Learning general Gaussian mixtures with efficient score matching [16.06356123715737]
We study the problem of learning mixtures of $k$ Gaussians in $d$ dimensions.
We make no separation assumptions on the underlying mixture components.
We give an algorithm that draws $dmathrmpoly(k/varepsilon)$ samples from the target mixture, runs in sample-polynomial time, and constructs a sampler.
arXiv Detail & Related papers (2024-04-29T17:30:36Z) - A Sublinear-Time Spectral Clustering Oracle with Improved Preprocessing
Time [6.961946145048321]
We address the problem of designing a sublinear-time spectral clustering oracle for graphs that exhibit strong clusterability.
Our algorithm relaxes assumptions, albeit at the cost of a slightly higher misclassification ratio.
Our clustering oracle is robust against a few random edge deletions.
arXiv Detail & Related papers (2023-10-27T03:40:37Z) - New Coresets for Projective Clustering and Applications [34.82221047030618]
Given a set of points $P$ in $mathbbRd$, the goal is to find $k$ flats of dimension $j$, i.e., affine subspaces, that best fit $P$ under a given distance measure.
We show that our construction provides efficient coreset constructions for Cauchy, Welsch, Huber, Geman-McClure, Tukey, $L_infty$, and Fair regression.
arXiv Detail & Related papers (2022-03-08T19:50:27Z) - Leverage Score Sampling for Tensor Product Matrices in Input Sparsity
Time [54.65688986250061]
We give an input sparsity time sampling algorithm for approximating the Gram matrix corresponding to the $q$-fold column-wise tensor product of $q$ matrices.
Our sampling technique relies on a collection of $q$ partially correlated random projections which can be simultaneously applied to a dataset $X$ in total time.
arXiv Detail & Related papers (2022-02-09T15:26:03Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Robustly Learning any Clusterable Mixture of Gaussians [55.41573600814391]
We study the efficient learnability of high-dimensional Gaussian mixtures in the adversarial-robust setting.
We provide an algorithm that learns the components of an $epsilon$-corrupted $k$-mixture within information theoretically near-optimal error proofs of $tildeO(epsilon)$.
Our main technical contribution is a new robust identifiability proof clusters from a Gaussian mixture, which can be captured by the constant-degree Sum of Squares proof system.
arXiv Detail & Related papers (2020-05-13T16:44:12Z) - Outlier-Robust Clustering of Non-Spherical Mixtures [5.863264019032882]
We give the first outlier-robust efficient algorithm for clustering a mixture of $k$ statistically separated d-dimensional Gaussians (k-GMMs)
Our results extend to clustering mixtures of arbitrary affine transforms of the uniform distribution on the $d$-dimensional unit sphere.
arXiv Detail & Related papers (2020-05-06T17:24:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.