Optimal estimation of sparse topic models
- URL: http://arxiv.org/abs/2001.07861v1
- Date: Wed, 22 Jan 2020 03:19:50 GMT
- Title: Optimal estimation of sparse topic models
- Authors: Xin Bing and Florentina Bunea and Marten Wegkamp
- Abstract summary: This paper studies the estimation of $A$ that is possibly element-wise sparse, and the number of topics $K$ is unknown.
We derive a new minimax lower bound for the estimation of such $A$ and propose a new computationally efficient algorithm for its recovery.
Our estimate adapts to the unknown sparsity of $A$ and our analysis is valid for any finite $n$, $p$, $K$ and document lengths.
- Score: 3.308743964406688
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Topic models have become popular tools for dimension reduction and
exploratory analysis of text data which consists in observed frequencies of a
vocabulary of $p$ words in $n$ documents, stored in a $p\times n$ matrix. The
main premise is that the mean of this data matrix can be factorized into a
product of two non-negative matrices: a $p\times K$ word-topic matrix $A$ and a
$K\times n$ topic-document matrix $W$. This paper studies the estimation of $A$
that is possibly element-wise sparse, and the number of topics $K$ is unknown.
In this under-explored context, we derive a new minimax lower bound for the
estimation of such $A$ and propose a new computationally efficient algorithm
for its recovery. We derive a finite sample upper bound for our estimator, and
show that it matches the minimax lower bound in many scenarios. Our estimate
adapts to the unknown sparsity of $A$ and our analysis is valid for any finite
$n$, $p$, $K$ and document lengths. Empirical results on both synthetic data
and semi-synthetic data show that our proposed estimator is a strong competitor
of the existing state-of-the-art algorithms for both non-sparse $A$ and sparse
$A$, and has superior performance is many scenarios of interest.
Related papers
- Optimal level set estimation for non-parametric tournament and crowdsourcing problems [49.75262185577198]
Motivated by crowdsourcing, we consider a problem where we partially observe the correctness of the answers of $n$ experts on $d$ questions.
In this paper, we assume that the matrix $M$ containing the probability that expert $i$ answers correctly to question $j$ is bi-isotonic up to a permutation of it rows and columns.
We construct an efficient-time algorithm that turns out to be minimax optimal for this classification problem.
arXiv Detail & Related papers (2024-08-27T18:28:31Z) - Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - Compressive Recovery of Sparse Precision Matrices [5.557600489035657]
We consider the problem of learning a graph modeling the statistical relations of the $d$ variables from a dataset with $n$ samples $X in mathbbRn times d$.
We show that it is possible to estimate it from a sketch of size $m=Omegaleft((d+2k)log(d)right)$ where $k$ is the maximal number of edges of the underlying graph.
We investigate the possibility of achieving practical recovery with an iterative algorithm based on the graphical lasso, viewed as a specific denoiser.
arXiv Detail & Related papers (2023-11-08T13:29:08Z) - Simple, Scalable and Effective Clustering via One-Dimensional
Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z) - Optimal Query Complexities for Dynamic Trace Estimation [59.032228008383484]
We consider the problem of minimizing the number of matrix-vector queries needed for accurate trace estimation in the dynamic setting where our underlying matrix is changing slowly.
We provide a novel binary tree summation procedure that simultaneously estimates all $m$ traces up to $epsilon$ error with $delta$ failure probability.
Our lower bounds (1) give the first tight bounds for Hutchinson's estimator in the matrix-vector product model with Frobenius norm error even in the static setting, and (2) are the first unconditional lower bounds for dynamic trace estimation.
arXiv Detail & Related papers (2022-09-30T04:15:44Z) - List-Decodable Sparse Mean Estimation [7.594050968868919]
A surge of recent research interest has been focusing on the list-decodable setting where $alpha in (0, frac12]$, and the goal is to output a finite number of estimates among which at least one approximates the target mean.
In this paper, we consider that the underlying target is the mean distribution $k$ and the first contribution is the first sample $Obig(mathrmpoly(k, log dbig)$, i.e. poly-logarithmic in the dimension dimension.
arXiv Detail & Related papers (2022-05-28T05:13:45Z) - Likelihood estimation of sparse topic distributions in topic models and
its applications to Wasserstein document distance calculations [3.679981089267181]
In topic models, the $ptimes n$ expected word frequency matrix is assumed to be factorized as a $ptimes K$ word-topic matrix $A$.
columns of $A$ are viewed as $p$-dimensional mixture components that are common to all documents.
When $A$ is unknown, we estimate $T$ by optimizing the likelihood function corresponding to a plug in, generic, estimator $hatA$ of $A$.
arXiv Detail & Related papers (2021-07-12T22:22:32Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - Learning a Latent Simplex in Input-Sparsity Time [58.30321592603066]
We consider the problem of learning a latent $k$-vertex simplex $KsubsetmathbbRdtimes n$, given access to $AinmathbbRdtimes n$.
We show that the dependence on $k$ in the running time is unnecessary given a natural assumption about the mass of the top $k$ singular values of $A$.
arXiv Detail & Related papers (2021-05-17T16:40:48Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.