A No-Free-Lunch Theorem for MultiTask Learning
- URL: http://arxiv.org/abs/2006.15785v4
- Date: Wed, 5 Aug 2020 18:05:50 GMT
- Title: A No-Free-Lunch Theorem for MultiTask Learning
- Authors: Steve Hanneke and Samory Kpotufe
- Abstract summary: We consider a seemingly favorable classification scenario where all tasks $P_t$ share a common optimal classifier $h*,$.
We show that, even though such regimes admit minimax rates accounting for both $n$ and $N$, no adaptive algorithm exists.
- Score: 19.645741778058227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multitask learning and related areas such as multi-source domain adaptation
address modern settings where datasets from $N$ related distributions $\{P_t\}$
are to be combined towards improving performance on any single such
distribution ${\cal D}$. A perplexing fact remains in the evolving theory on
the subject: while we would hope for performance bounds that account for the
contribution from multiple tasks, the vast majority of analyses result in
bounds that improve at best in the number $n$ of samples per task, but most
often do not improve in $N$. As such, it might seem at first that the
distributional settings or aggregation procedures considered in such analyses
might be somehow unfavorable; however, as we show, the picture happens to be
more nuanced, with interestingly hard regimes that might appear otherwise
favorable.
In particular, we consider a seemingly favorable classification scenario
where all tasks $P_t$ share a common optimal classifier $h^*,$ and which can be
shown to admit a broad range of regimes with improved oracle rates in terms of
$N$ and $n$. Some of our main results are as follows:
$\bullet$ We show that, even though such regimes admit minimax rates
accounting for both $n$ and $N$, no adaptive algorithm exists; that is, without
access to distributional information, no algorithm can guarantee rates that
improve with large $N$ for $n$ fixed.
$\bullet$ With a bit of additional information, namely, a ranking of tasks
$\{P_t\}$ according to their distance to a target ${\cal D}$, a simple
rank-based procedure can achieve near optimal aggregations of tasks' datasets,
despite a search space exponential in $N$. Interestingly, the optimal
aggregation might exclude certain tasks, even though they all share the same
$h^*$.
Related papers
- Metalearning with Very Few Samples Per Task [19.78398372660794]
We consider a binary classification setting where tasks are related by a shared representation.
Here, the amount of data is measured in terms of the number of tasks $t$ that we need to see and the number of samples $n$ per task.
Our work also yields a characterization of distribution-free multitask learning and reductions between meta and multitask learning.
arXiv Detail & Related papers (2023-12-21T16:06:44Z) - Stochastic Approximation Approaches to Group Distributionally Robust
Optimization [96.26317627118912]
Group distributionally robust optimization (GDRO)
Online learning techniques to reduce the number of samples required in each round from $m$ to $1$, keeping the same sample.
A novel formulation of weighted GDRO, which allows us to derive distribution-dependent convergence rates.
arXiv Detail & Related papers (2023-02-18T09:24:15Z) - On the Sample Complexity of Representation Learning in Multi-task
Bandits with Global and Local structure [77.60508571062958]
We investigate the sample complexity of learning the optimal arm for multi-task bandit problems.
Arms consist of two components: one that is shared across tasks (that we call representation) and one that is task-specific (that we call predictor)
We devise an algorithm OSRL-SC whose sample complexity approaches the lower bound, and scales at most as $H(Glog(delta_G)+ Xlog(delta_H))$, with $X,G,H$ being, respectively, the number of tasks, representations and predictors.
arXiv Detail & Related papers (2022-11-28T08:40:12Z) - TURF: A Two-factor, Universal, Robust, Fast Distribution Learning
Algorithm [64.13217062232874]
One of its most powerful and successful modalities approximates every distribution to an $ell$ distance essentially at most a constant times larger than its closest $t$-piece degree-$d_$.
We provide a method that estimates this number near-optimally, hence helps approach the best possible approximation.
arXiv Detail & Related papers (2022-02-15T03:49:28Z) - Minimax Optimal Algorithms with Fixed-$k$-Nearest Neighbors [13.231906521852718]
We consider a distributed learning scenario in which a massive dataset is split into smaller groups.
We propose emphoptimal rules to aggregate the fixed-$k$-NN information for classification, regression, and density estimation.
We show that the distributed algorithm with a fixed $k$ over a sufficiently large number of groups attains a minimax optimal error rate up to a multiplicative logarithmic factor.
arXiv Detail & Related papers (2022-02-05T01:59:09Z) - Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations [79.66404989555566]
We consider the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies $Pi$ that may not contain any near-optimal policy.
We provide an algorithm for this setting whose error is bounded in terms of the rank $d$ of the underlying MDP.
arXiv Detail & Related papers (2021-06-22T03:20:40Z) - On the Power of Multitask Representation Learning in Linear MDP [61.58929164172968]
This paper presents analyses for the statistical benefit of multitask representation learning in linear Markov Decision Process (MDP)
We first discover a emphLeast-Activated-Feature-Abundance (LAFA) criterion, denoted as $kappa$, with which we prove that a straightforward least-square algorithm learns a policy which is $tildeO(H2sqrtfrackappa mathcalC(Phi)2 kappa dNT+frackappa dn)
arXiv Detail & Related papers (2021-06-15T11:21:06Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.