Pareto Frontiers in Neural Feature Learning: Data, Compute, Width, and
Luck
- URL: http://arxiv.org/abs/2309.03800v2
- Date: Mon, 30 Oct 2023 15:32:25 GMT
- Title: Pareto Frontiers in Neural Feature Learning: Data, Compute, Width, and
Luck
- Authors: Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, Cyril
Zhang
- Abstract summary: We consider offline sparse parity learning, a supervised classification problem which admits a statistical query lower bound for gradient-based training of a multilayer perceptron.
We show, theoretically and experimentally, that sparse initialization and increasing network width yield significant improvements in sample efficiency in this setting.
We also show that the synthetic sparse parity task can be useful as a proxy for real problems requiring axis-aligned feature learning.
- Score: 35.6883212537938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In modern deep learning, algorithmic choices (such as width, depth, and
learning rate) are known to modulate nuanced resource tradeoffs. This work
investigates how these complexities necessarily arise for feature learning in
the presence of computational-statistical gaps. We begin by considering offline
sparse parity learning, a supervised classification problem which admits a
statistical query lower bound for gradient-based training of a multilayer
perceptron. This lower bound can be interpreted as a multi-resource tradeoff
frontier: successful learning can only occur if one is sufficiently rich (large
model), knowledgeable (large dataset), patient (many training iterations), or
lucky (many random guesses). We show, theoretically and experimentally, that
sparse initialization and increasing network width yield significant
improvements in sample efficiency in this setting. Here, width plays the role
of parallel search: it amplifies the probability of finding "lottery ticket"
neurons, which learn sparse features more sample-efficiently. Finally, we show
that the synthetic sparse parity task can be useful as a proxy for real
problems requiring axis-aligned feature learning. We demonstrate improved
sample efficiency on tabular classification benchmarks by using wide,
sparsely-initialized MLP models; these networks sometimes outperform tuned
random forests.
Related papers
- Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples.
Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance.
We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z) - Collaborative Learning with Different Labeling Functions [7.228285747845779]
We study a variant of Collaborative PAC Learning, in which we aim to learn an accurate classifier for each of the $n$ data distributions.
We show that, when the data distributions satisfy a weaker realizability assumption, sample-efficient learning is still feasible.
arXiv Detail & Related papers (2024-02-16T04:32:22Z) - More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory [12.689249854199982]
We show that the test risk of RF regression decreases monotonically with both the number of features and the number of samples.
We then demonstrate that, for a large class of tasks characterized by powerlaw eigenstructure, training to near-zero training loss is obligatory.
arXiv Detail & Related papers (2023-11-24T18:27:41Z) - Provable Advantage of Curriculum Learning on Parity Targets with Mixed
Inputs [21.528321119061694]
We show a separation result in the number of training steps with standard (bounded) learning rates on a common sample distribution.
We also provide experimental results supporting the qualitative separation beyond the specific regime of the theoretical results.
arXiv Detail & Related papers (2023-06-29T13:14:42Z) - Sampling weights of deep neural networks [1.2370077627846041]
We introduce a probability distribution, combined with an efficient sampling algorithm, for weights and biases of fully-connected neural networks.
In a supervised learning context, no iterative optimization or gradient computations of internal network parameters are needed.
We prove that sampled networks are universal approximators.
arXiv Detail & Related papers (2023-06-29T10:13:36Z) - Learning sparse features can lead to overfitting in neural networks [9.2104922520782]
We show that feature learning can perform worse than lazy training.
Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth.
arXiv Detail & Related papers (2022-06-24T14:26:33Z) - BatchFormer: Learning to Explore Sample Relationships for Robust
Representation Learning [93.38239238988719]
We propose to enable deep neural networks with the ability to learn the sample relationships from each mini-batch.
BatchFormer is applied into the batch dimension of each mini-batch to implicitly explore sample relationships during training.
We perform extensive experiments on over ten datasets and the proposed method achieves significant improvements on different data scarcity applications.
arXiv Detail & Related papers (2022-03-03T05:31:33Z) - Smoothed Online Learning is as Easy as Statistical Learning [77.00766067963195]
We provide the first oracle-efficient, no-regret algorithms in this setting.
We show that if a function class is learnable in the classical setting, then there is an oracle-efficient, no-regret algorithm for contextual bandits.
arXiv Detail & Related papers (2022-02-09T19:22:34Z) - Gone Fishing: Neural Active Learning with Fisher Embeddings [55.08537975896764]
There is an increasing need for active learning algorithms that are compatible with deep neural networks.
This article introduces BAIT, a practical representation of tractable, and high-performing active learning algorithm for neural networks.
arXiv Detail & Related papers (2021-06-17T17:26:31Z) - Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems [83.98774574197613]
We take one of the simplest inference methods, a truncated max-product Belief propagation, and add what is necessary to make it a proper component of a deep learning model.
This BP-Layer can be used as the final or an intermediate block in convolutional neural networks (CNNs)
The model is applicable to a range of dense prediction problems, is well-trainable and provides parameter-efficient and robust solutions in stereo, optical flow and semantic segmentation.
arXiv Detail & Related papers (2020-03-13T13:11:35Z) - Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction.
We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data.
Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.