How to Shrink Confidence Sets for Many Equivalent Discrete Distributions?
- URL: http://arxiv.org/abs/2407.15662v1
- Date: Mon, 22 Jul 2024 14:19:19 GMT
- Title: How to Shrink Confidence Sets for Many Equivalent Discrete Distributions?
- Authors: Odalric-Ambrym Maillard, Mohammad Sadegh Talebi,
- Abstract summary: We exploit permutation-equivalence in machine learning problems.
We show that the size of confidence sets shrinks at rates of $O/sqrtn_k)$ and $O/max_kin K n_k)$.
- Score: 17.52590726972374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the situation when a learner faces a set of unknown discrete distributions $(p_k)_{k\in \mathcal K}$ defined over a common alphabet $\mathcal X$, and can build for each distribution $p_k$ an individual high-probability confidence set thanks to $n_k$ observations sampled from $p_k$. The set $(p_k)_{k\in \mathcal K}$ is structured: each distribution $p_k$ is obtained from the same common, but unknown, distribution q via applying an unknown permutation to $\mathcal X$. We call this \emph{permutation-equivalence}. The goal is to build refined confidence sets \emph{exploiting} this structural property. Like other popular notions of structure (Lipschitz smoothness, Linearity, etc.) permutation-equivalence naturally appears in machine learning problems, and to benefit from its potential gain calls for a specific approach. We present a strategy to effectively exploit permutation-equivalence, and provide a finite-time high-probability bound on the size of the refined confidence sets output by the strategy. Since a refinement is not possible for too few observations in general, under mild technical assumptions, our finite-time analysis establish when the number of observations $(n_k)_{k\in \mathcal K}$ are large enough so that the output confidence sets improve over initial individual sets. We carefully characterize this event and the corresponding improvement. Further, our result implies that the size of confidence sets shrink at asymptotic rates of $O(1/\sqrt{\sum_{k\in \mathcal K} n_k})$ and $O(1/\max_{k\in K} n_{k})$, respectively for elements inside and outside the support of q, when the size of each individual confidence set shrinks at respective rates of $O(1/\sqrt{n_k})$ and $O(1/n_k)$. We illustrate the practical benefit of exploiting permutation equivalence on a reinforcement learning task.
Related papers
- Rényi divergence-based uniformity guarantees for $k$-universal hash functions [59.90381090395222]
Universal hash functions map the output of a source to random strings over a finite alphabet.
We show that it is possible to distill random bits that are nearly uniform, as measured by min-entropy.
arXiv Detail & Related papers (2024-10-21T19:37:35Z) - Optimal level set estimation for non-parametric tournament and crowdsourcing problems [49.75262185577198]
Motivated by crowdsourcing, we consider a problem where we partially observe the correctness of the answers of $n$ experts on $d$ questions.
In this paper, we assume that the matrix $M$ containing the probability that expert $i$ answers correctly to question $j$ is bi-isotonic up to a permutation of it rows and columns.
We construct an efficient-time algorithm that turns out to be minimax optimal for this classification problem.
arXiv Detail & Related papers (2024-08-27T18:28:31Z) - Transfer Operators from Batches of Unpaired Points via Entropic
Transport Kernels [3.099885205621181]
We derive a maximum-likelihood inference functional, propose a computationally tractable approximation and analyze their properties.
We prove a $Gamma$-convergence result showing that we can recover the true density from empirical approximations as the number $N$ of blocks goes to infinity.
arXiv Detail & Related papers (2024-02-13T12:52:41Z) - Identification of Mixtures of Discrete Product Distributions in
Near-Optimal Sample and Time Complexity [6.812247730094931]
We show, for any $ngeq 2k-1$, how to achieve sample complexity and run-time complexity $(1/zeta)O(k)$.
We also extend the known lower bound of $eOmega(k)$ to match our upper bound across a broad range of $zeta$.
arXiv Detail & Related papers (2023-09-25T09:50:15Z) - Learning and Covering Sums of Independent Random Variables with
Unbounded Support [4.458210211781738]
We study the problem of covering and learning sums $X = X_1 + cdots + X_n$ of independent integer-valued random variables $X_i$ with unbounded, or even infinite, support.
We show that the maximum value of the collective support of $X_i$'s necessarily appears in the sample complexity of learning $X$.
arXiv Detail & Related papers (2022-10-24T15:03:55Z) - Near-optimal fitting of ellipsoids to random points [68.12685213894112]
A basic problem of fitting an ellipsoid to random points has connections to low-rank matrix decompositions, independent component analysis, and principal component analysis.
We resolve this conjecture up to logarithmic factors by constructing a fitting ellipsoid for some $n = Omega(, d2/mathrmpolylog(d),)$.
Our proof demonstrates feasibility of the least squares construction of Saunderson et al. using a convenient decomposition of a certain non-standard random matrix.
arXiv Detail & Related papers (2022-08-19T18:00:34Z) - Metric-Fair Classifier Derandomization [6.269732593554894]
We study the problem of classifier derandomization in machine learning.
We show that the prior derandomization approach is almost maximally metric-unfair.
We devise a derandomization procedure that provides an appealing tradeoff between these two.
arXiv Detail & Related papers (2022-06-15T21:36:57Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z) - Locally Private Hypothesis Selection [96.06118559817057]
We output a distribution from $mathcalQ$ whose total variation distance to $p$ is comparable to the best such distribution.
We show that the constraint of local differential privacy incurs an exponential increase in cost.
Our algorithms result in exponential improvements on the round complexity of previous methods.
arXiv Detail & Related papers (2020-02-21T18:30:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.