Towards the Fundamental Limits of Knowledge Transfer over Finite Domains
- URL: http://arxiv.org/abs/2310.07838v4
- Date: Tue, 14 Nov 2023 11:26:10 GMT
- Title: Towards the Fundamental Limits of Knowledge Transfer over Finite Domains
- Authors: Qingyue Zhao and Banghua Zhu
- Abstract summary: We show that privileged information at three progressive levels accelerates the transfer.
At the first level, only samples with hard labels are known, via which the maximum likelihood estimator attains the minimax rate $sqrt|mathcal Smathcal A|/n$.
The third level further equips the student with the soft labels (complete logits) on $mathcal A$ given every sampled input, thereby provably enables the student to enjoy a rate $|mathcal S|/n$ free of $
- Score: 8.575522204707957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We characterize the statistical efficiency of knowledge transfer through $n$
samples from a teacher to a probabilistic student classifier with input space
$\mathcal S$ over labels $\mathcal A$. We show that privileged information at
three progressive levels accelerates the transfer. At the first level, only
samples with hard labels are known, via which the maximum likelihood estimator
attains the minimax rate $\sqrt{{|{\mathcal S}||{\mathcal A}|}/{n}}$. The
second level has the teacher probabilities of sampled labels available in
addition, which turns out to boost the convergence rate lower bound to
${{|{\mathcal S}||{\mathcal A}|}/{n}}$. However, under this second data
acquisition protocol, minimizing a naive adaptation of the cross-entropy loss
results in an asymptotically biased student. We overcome this limitation and
achieve the fundamental limit by using a novel empirical variant of the squared
error logit loss. The third level further equips the student with the soft
labels (complete logits) on ${\mathcal A}$ given every sampled input, thereby
provably enables the student to enjoy a rate ${|{\mathcal S}|}/{n}$ free of
$|{\mathcal A}|$. We find any Kullback-Leibler divergence minimizer to be
optimal in the last case. Numerical simulations distinguish the four learners
and corroborate our theory.
Related papers
- Fast Rates for Bandit PAC Multiclass Classification [73.17969992976501]
We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct.
Our main contribution is in designing a novel learning algorithm for the agnostic $(varepsilon,delta)$PAC version of the problem.
arXiv Detail & Related papers (2024-06-18T08:54:04Z) - Transfer Learning Beyond Bounded Density Ratios [21.522183597134234]
We study the fundamental problem of transfer learning where a learning algorithm collects data from some source distribution $P$ but needs to perform well with respect to a different target distribution $Q$.
Our main result is a general transfer inequality over the domain $mathbbRn$, proving that non-trivial transfer learning for low-degrees is possible under very mild assumptions.
arXiv Detail & Related papers (2024-03-18T17:02:41Z) - Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss [33.18537822803389]
We show that whenever the topologies of $L2$ and $Psi_p$ are comparable on our hypothesis class $mathscrF$, $mathscrF$ is a weakly sub-Gaussian class.
Our result holds whether the problem is realizable or not and we refer to this as a emphnear mixing-free rate, since direct dependence on mixing is relegated to an additive higher order term.
arXiv Detail & Related papers (2024-02-08T18:57:42Z) - Towards a statistical theory of data selection under weak supervision [7.540077751816086]
Given a sample of size $N$, it is often useful to select a subsample of smaller size $nN$ to be used for statistical estimation or learning.
We assume to be given $N$ unlabeled samples $bold x_i_ile N$, and to be given access to a surrogate model' that can predict labels $y_i$ better than random guessing.
arXiv Detail & Related papers (2023-09-25T22:23:27Z) - Active Cost-aware Labeling of Streaming Data [11.501619634838312]
We study actively labeling streaming data, where an active learner is faced with a stream of data points and must carefully choose which of these points to label via an expensive experiment.
We first study a setting when the data's inputs belong to one of $K$ discrete distributions and formalize this problem via a loss that captures the labeling cost and the prediction error.
When the labeling cost is $B$, our algorithm, which chooses to label a point if the uncertainty is larger than a time and cost dependent threshold, achieves a worst-case upper bound of $widetildeO(Bfrac1
arXiv Detail & Related papers (2023-04-13T20:23:27Z) - Statistical Hypothesis Testing Based on Machine Learning: Large
Deviations Analysis [15.605887551756933]
We study the performance -- and specifically the rate at which the error probability converges to zero -- of Machine Learning (ML) classification techniques.
We provide the mathematical conditions for a ML to exhibit error probabilities that vanish exponentially, say $sim expleft(-n,I + o(n) right)
In other words, the classification error probability convergence to zero and its rate can be computed on a portion of the dataset available for training.
arXiv Detail & Related papers (2022-07-22T08:30:10Z) - High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad
Stepsize [55.0090961425708]
We propose a new, simplified high probability analysis of AdaGrad for smooth, non- probability problems.
We present our analysis in a modular way and obtain a complementary $mathcal O (1 / TT)$ convergence rate in the deterministic setting.
To the best of our knowledge, this is the first high probability for AdaGrad with a truly adaptive scheme, i.e., completely oblivious to the knowledge of smoothness.
arXiv Detail & Related papers (2022-04-06T13:50:33Z) - Provably Breaking the Quadratic Error Compounding Barrier in Imitation
Learning, Optimally [58.463668865380946]
We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$.
We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020)
We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
arXiv Detail & Related papers (2021-02-25T15:50:19Z) - Agnostic Learning of Halfspaces with Gradient Descent via Soft Margins [92.7662890047311]
We show that gradient descent finds halfspaces with classification error $tilde O(mathsfOPT1/2) + varepsilon$ in $mathrmpoly(d,1/varepsilon)$ time and sample complexity.
arXiv Detail & Related papers (2020-10-01T16:48:33Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.