Near-Optimal Learning of Tree-Structured Distributions by Chow-Liu
- URL: http://arxiv.org/abs/2011.04144v2
- Date: Thu, 22 Jul 2021 06:37:12 GMT
- Title: Near-Optimal Learning of Tree-Structured Distributions by Chow-Liu
- Authors: Arnab Bhattacharyya, Sutanu Gayen, Eric Price, N. V. Vinodchandran
- Abstract summary: We provide finite sample guarantees for the classical ChowLiu algorithm (IEEE Trans.Inform.Theory, 1968)
We show that for a specific tree $T$, with $widetildeO (|Sigma|2nvarepsilon-1)$ samples from a distribution $P$ over $Sigman$, one can efficiently learn the closest KL divergence.
- Score: 14.298220510927695
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We provide finite sample guarantees for the classical Chow-Liu algorithm
(IEEE Trans.~Inform.~Theory, 1968) to learn a tree-structured graphical model
of a distribution. For a distribution $P$ on $\Sigma^n$ and a tree $T$ on $n$
nodes, we say $T$ is an $\varepsilon$-approximate tree for $P$ if there is a
$T$-structured distribution $Q$ such that $D(P\;||\;Q)$ is at most
$\varepsilon$ more than the best possible tree-structured distribution for $P$.
We show that if $P$ itself is tree-structured, then the Chow-Liu algorithm with
the plug-in estimator for mutual information with $\widetilde{O}(|\Sigma|^3
n\varepsilon^{-1})$ i.i.d.~samples outputs an $\varepsilon$-approximate tree
for $P$ with constant probability. In contrast, for a general $P$ (which may
not be tree-structured), $\Omega(n^2\varepsilon^{-2})$ samples are necessary to
find an $\varepsilon$-approximate tree. Our upper bound is based on a new
conditional independence tester that addresses an open problem posed by
Canonne, Diakonikolas, Kane, and Stewart~(STOC, 2018): we prove that for three
random variables $X,Y,Z$ each over $\Sigma$, testing if $I(X; Y \mid Z)$ is $0$
or $\geq \varepsilon$ is possible with $\widetilde{O}(|\Sigma|^3/\varepsilon)$
samples. Finally, we show that for a specific tree $T$, with $\widetilde{O}
(|\Sigma|^2n\varepsilon^{-1})$ samples from a distribution $P$ over $\Sigma^n$,
one can efficiently learn the closest $T$-structured distribution in KL
divergence by applying the add-1 estimator at each node.
Related papers
- Dimension-free Private Mean Estimation for Anisotropic Distributions [55.86374912608193]
Previous private estimators on distributions over $mathRd suffer from a curse of dimensionality.
We present an algorithm whose sample complexity has improved dependence on dimension.
arXiv Detail & Related papers (2024-11-01T17:59:53Z) - Almost Minimax Optimal Best Arm Identification in Piecewise Stationary Linear Bandits [55.957560311008926]
We propose a piecewise stationary linear bandit (PSLB) model where the quality of an arm is measured by its return averaged over all contexts.
PS$varepsilon$BAI$+$ is guaranteed to identify an $varepsilon$-optimal arm with probability $ge 1-delta$ and with a minimal number of samples.
arXiv Detail & Related papers (2024-10-10T06:15:42Z) - A spectral least-squares-type method for heavy-tailed corrupted
regression with unknown covariance \& heterogeneous noise [2.019622939313173]
We revisit heavy-tailed corrupted least-squares linear regression assuming to have a corrupted $n$-sized label-feature sample of at most $epsilon n$ arbitrary outliers.
We propose a near-optimal computationally tractable estimator, based on the power method, assuming no knowledge on $(Sigma,Xi) nor the operator norm of $Xi$.
arXiv Detail & Related papers (2022-09-06T23:37:31Z) - Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products [58.05771390012827]
We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm.
Our main result is an algorithm that uses only $tildeO(k/sqrtepsilon)$ matrix-vector products.
arXiv Detail & Related papers (2022-02-10T16:10:41Z) - Tree density estimation [12.831051269764115]
density estimation for a random vector $boldsymbol X$ in $mathbb Rd$ with probability density $f(boldsymbol x)$.
For Lipschitz continuous $f$ with bounded support, $mathbb E int |f_n(boldsymbol x)-fT*(boldsymbol x)|dboldsymbol x=0$ a.s.
For Lipschitz continuous $f$ with bounded support, $mathbb E int |f_n(boldsymbol x)-f
arXiv Detail & Related papers (2021-11-23T16:05:59Z) - Coresets for Decision Trees of Signals [19.537354146654845]
We provide the first algorithm that outputs such a $(k,varepsilon)$-coreset for emphevery such matrix $D$.
This is by forging a link between decision trees from machine learning -- to partition trees in computational geometry.
arXiv Detail & Related papers (2021-10-07T05:49:55Z) - Learning low-degree functions from a logarithmic number of random
queries [77.34726150561087]
We prove that for any integer $ninmathbbN$, $din1,ldots,n$ and any $varepsilon,deltain(0,1)$, a bounded function $f:-1,1nto[-1,1]$ of degree at most $d$ can be learned.
arXiv Detail & Related papers (2021-09-21T13:19:04Z) - Self-training Converts Weak Learners to Strong Learners in Mixture
Models [86.7137362125503]
We show that a pseudolabeler $boldsymbolbeta_mathrmpl$ can achieve classification error at most $C_mathrmerr$.
We additionally show that by running gradient descent on the logistic loss one can obtain a pseudolabeler $boldsymbolbeta_mathrmpl$ with classification error $C_mathrmerr$ using only $O(d)$ labeled examples.
arXiv Detail & Related papers (2021-06-25T17:59:16Z) - Robust Gaussian Covariance Estimation in Nearly-Matrix Multiplication
Time [14.990725929840892]
We show an algorithm that runs in time $widetildeO(T(N, d) log kappa / mathrmpoly (varepsilon))$, where $T(N, d)$ is the time it takes to multiply a $d times N$ matrix by its transpose.
Our runtime matches that of the fastest algorithm for covariance estimation without outliers, up to poly-logarithmic factors.
arXiv Detail & Related papers (2020-06-23T20:21:27Z) - Efficient Statistics for Sparse Graphical Models from Truncated Samples [19.205541380535397]
We focus on two fundamental and classical problems: (i) inference of sparse Gaussian graphical models and (ii) support recovery of sparse linear models.
For sparse linear regression, suppose samples $(bf x,y)$ are generated where $y = bf xtopOmega* + mathcalN(0,1)$ and $(bf x, y)$ is seen only if $y$ belongs to a truncation set $S subseteq mathbbRd$.
arXiv Detail & Related papers (2020-06-17T09:21:00Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.