Related papers: Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing

Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing

URL: http://arxiv.org/abs/2301.11500v1
Date: Fri, 27 Jan 2023 02:30:51 GMT
Title: Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing
Authors: Jikai Jin and Zhiyuan Li and Kaifeng Lyu and Simon S. Du and Jason D. Lee
Abstract summary: It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in machine learning models. This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem.
Score: 74.2952487120137
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in training machine learning models. This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem, whose goal is to recover a low-rank ground-truth matrix from near-isotropic linear measurements. It is shown that GD with small initialization behaves similarly to the greedy low-rank learning heuristics (Li et al., 2020) and follows an incremental learning procedure (Gissin et al., 2019): GD sequentially learns solutions with increasing ranks until it recovers the ground truth matrix. Compared to existing works which only analyze the first learning phase for rank-1 solutions, our result provides characterizations for the whole learning process. Moreover, besides the over-parameterized regime that many prior works focused on, our analysis of the incremental learning procedure also applies to the under-parameterized regime. Finally, we conduct numerical experiments to confirm our theoretical findings.

Related papers

Scaling Law for Stochastic Gradient Descent in Quadratically Parameterized Linear Regression [5.801904710149222]
In machine learning, the scaling law describes how the model performance improves with the model and data size scaling up. This paper studies the scaling law over a linear regression with the model being quadratically parameterized. As a result, in the canonical linear regression, we provide explicit separations for curves between generalization with and without feature learning, and the information-theoretical lower bound that is to parametrization method and the algorithm.
arXiv Detail & Related papers (2025-02-13T09:29:04Z)
A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning. These problems are often formalized as Bi-Level optimizations (BLO) We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z)
Dynamic Decoupling of Placid Terminal Attractor-based Gradient Descent Algorithm [56.06235614890066]
Gradient descent (GD) and gradient descent (SGD) have been widely used in a number of application domains. This paper carefully analyzes the dynamics of GD based on the terminal attractor at different stages of its gradient flow.
arXiv Detail & Related papers (2024-09-10T14:15:56Z)
Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks [3.680127959836384]
implicit gradient descent (IGD) outperforms the common gradient descent (GD) in handling certain multi-scale problems. We show that IGD converges a globally optimal solution at a linear convergence rate.
arXiv Detail & Related papers (2024-07-03T06:10:41Z)
Understanding Forgetting in Continual Learning with Linear Regression [21.8755265936716]
Continual learning, focused on sequentially learning multiple tasks, has gained significant attention recently. We provide a general theoretical analysis of forgetting in the linear regression model via Gradient Descent. We demonstrate that, given a sufficiently large data size, the arrangement of tasks in a sequence, where tasks with larger eigenvalues in their population data covariance matrices are trained later, tends to result in increased forgetting.
arXiv Detail & Related papers (2024-05-27T18:33:37Z)
Model-Agnostic Zeroth-Order Policy Optimization for Meta-Learning of Ergodic Linear Quadratic Regulators [13.343937277604892]
We study the problem of using meta-learning to deal with uncertainty and heterogeneity in ergodic linear quadratic regulators. We propose an algorithm that omits the estimation of policy Hessian, which applies to tasks of learning a set of heterogeneous but similar linear dynamic systems. We provide a convergence result for the exact gradient descent process by analyzing the boundedness and smoothness of the gradient for the meta-objective.
arXiv Detail & Related papers (2024-05-27T17:26:36Z)
Theoretical Characterization of the Generalization Performance of Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features. We find new and interesting properties that do not exist in single-task linear regression. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z)
Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks [59.142826407441106]
We study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability. We consider gradient descent (GD) and gradient descent (SGD) to train SNNs, for both of which we develop consistent excess bounds.
arXiv Detail & Related papers (2022-09-19T18:48:00Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.