Related papers: Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs

Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs

URL: http://arxiv.org/abs/2406.02157v1
Date: Tue, 4 Jun 2024 09:44:49 GMT
Title: Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs
Authors: Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Bruno Loureiro, Luca Pesce, Ludovic Stephan,
Abstract summary: We study the impact of the batch size on the iteration time $T$ of training two-layer neural networks with one-pass gradient descent (SGD) We show that performing gradient updates with large batches minimizes the training time without changing the total sample complexity. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs)
Score: 24.305423716384272
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the impact of the batch size $n_b$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches $n_b \lesssim d^{\frac{\ell}{2}}$ minimizes the training time without changing the total sample complexity, where $\ell$ is the information exponent of the target to be learned \citep{arous2021online} and $d$ is the input dimension. However, larger batch sizes than $n_b \gg d^{\frac{\ell}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, \textit{Correlation loss SGD}, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments.

Related papers

LoRIF: Low-Rank Influence Functions for Scalable Training Data Attribution [62.830878652285406]
Training data attribution identifies which training examples most influenced a model's prediction.<n>LoRIF exploits low-rank structures of gradient to address both bottlenecks.<n>On models from 0.1B to 70B parameters trained on datasets with millions of examples, LoRIF achieves up to 20$times$ storage reduction and query-time speedup.
arXiv Detail & Related papers (2026-01-29T16:18:34Z)
Understanding the Role of Training Data in Test-Time Scaling [56.12341509545198]
We study the performance of test-time scaling for transformers trained on an in-context weight prediction task for linear regression.<n>We show that training on a diverse, relevant, and hard set of tasks results in best performance for test-time scaling.
arXiv Detail & Related papers (2025-10-04T01:38:48Z)
Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum [78.27945336558987]
Decentralized server (DFL) eliminates reliance on client-client architecture. Non-smooth regularization is often incorporated into machine learning tasks. We propose a novel novel DNCFL algorithm to solve these problems.
arXiv Detail & Related papers (2025-04-17T08:32:25Z)
Faster Convergence of Riemannian Stochastic Gradient Descent with Increasing Batch Size [0.6906005491572401]
Using an increasing batch size leads to faster RSGD than using a constant batch size. Experiments on principal component analysis and low-rank matrix problems confirmed that, using a growth batch size or an exponential growth batch size results in better performance than using a constant batch size.
arXiv Detail & Related papers (2025-01-30T06:23:28Z)
Learning Orthogonal Multi-Index Models: A Fine-Grained Information Exponent Analysis [54.57279006229212]
Information exponent has played an important role in predicting the sample complexity of online gradient descent.<n>In this work, we show that by considering both second- and higher-order terms, we can first learn the relevant space using the second-order terms.<n>The overall sample and complexity of online SGD is $tildeO( d PL-1 )$.
arXiv Detail & Related papers (2024-10-13T00:14:08Z)
Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ with a complexity that is not governed by information exponents.
arXiv Detail & Related papers (2024-06-03T17:56:58Z)
Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
Provably Efficient High-Dimensional Bandit Learning with Batched Feedbacks [93.00280593719513]
We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches. In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch. Our algorithm achieves regret bounds comparable to those in fully sequential setting with only $mathcalO( log T)$ batches.
arXiv Detail & Related papers (2023-11-22T06:06:54Z)
How Two-Layer Neural Networks Learn, One (Giant) Step at a Time [24.773974771715956]
We investigate theoretically how the features of a two-layer neural network adapt to the structure of the target function. We compare the influence of batch size and that of multiple (but finitely many) steps. We show that a batch-size of $n = mathcalO(d)$ is indeed enough to learn multiple target directions satisfying a staircase property.
arXiv Detail & Related papers (2023-05-29T17:43:44Z)
Training trajectories, mini-batch losses and the curious role of the learning rate [13.848916053916618]
We show that validated gradient descent plays a fundamental role in nearly all applications of deep learning. We propose a simple model and a geometric interpretation that allows to analyze the relationship between the gradients of mini-batches and the full batch. In particular, a very low loss value can be reached just one step of descent with large enough learning rate.
arXiv Detail & Related papers (2023-01-05T21:58:46Z)
Learning Compact Features via In-Training Representation Alignment [19.273120635948363]
In each epoch, the true gradient of the loss function is estimated using a mini-batch sampled from the training set. We propose In-Training Representation Alignment (ITRA) that explicitly aligns feature distributions of two different mini-batches with a matching loss. We also provide a rigorous analysis of the desirable effects of the matching loss on feature representation learning.
arXiv Detail & Related papers (2022-11-23T22:23:22Z)
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED) PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z)
Locality defeats the curse of dimensionality in convolutional teacher-student scenarios [69.2027612631023]
We show that locality is key in determining the learning curve exponent $beta$. We conclude by proving, using a natural assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.
arXiv Detail & Related papers (2021-06-16T08:27:31Z)
Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss. We examine how these benign overfitting phenomena occur in a two-layer neural network setting. We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z)
The Computational Complexity of ReLU Network Training Parameterized by Data Dimensionality [8.940054309023525]
We analyze the influence of the dimension $d$ of the training data on the computational complexity. We prove that known brute-force strategies are essentially optimal. In particular, we extend a known-time algorithm for constant $d$ and convex loss functions to a more general class of loss functions.
arXiv Detail & Related papers (2021-05-18T17:05:26Z)
Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data. Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z)
Large-time asymptotics in deep learning [0.0]
We consider the impact of the final time $T$ (which may indicate the depth of a corresponding ResNet) in training. For the classical $L2$--regularized empirical risk minimization problem, we show that the training error is at most of the order $mathcalOleft(frac1Tright)$. In the setting of $ellp$--distance losses, we prove that both the training error and the optimal parameters are at most of the order $mathcalOleft(e-mu
arXiv Detail & Related papers (2020-08-06T07:33:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.