Related papers: Applying statistical learning theory to deep learning

Applying statistical learning theory to deep learning

URL: http://arxiv.org/abs/2311.15404v2
Date: Mon, 25 Mar 2024 22:55:43 GMT
Title: Applying statistical learning theory to deep learning
Authors: Cédric Gerbelot, Avetik Karagulyan, Stefani Karp, Kavya Ravichandran, Menachem Stern, Nathan Srebro,
Abstract summary: The goal of these lectures is to provide an overview of some of the main questions that arise when attempting to understand deep learning. We discuss implicit bias in the context of benign overfitting. We provide a detailed study of the implicit bias of gradient descent on linear diagonal networks for various regression tasks.
Score: 21.24637996678039
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although statistical learning theory provides a robust framework to understand supervised learning, many theoretical aspects of deep learning remain unclear, in particular how different architectures may lead to inductive bias when trained using gradient based methods. The goal of these lectures is to provide an overview of some of the main questions that arise when attempting to understand deep learning from a learning theory perspective. After a brief reminder on statistical learning theory and stochastic optimization, we discuss implicit bias in the context of benign overfitting. We then move to a general description of the mirror descent algorithm, showing how we may go back and forth between a parameter space and the corresponding function space for a given learning problem, as well as how the geometry of the learning problem may be represented by a metric tensor. Building on this framework, we provide a detailed study of the implicit bias of gradient descent on linear diagonal networks for various regression tasks, showing how the loss function, scale of parameters at initialization and depth of the network may lead to various forms of implicit bias, in particular transitioning between kernel or feature learning.

Related papers

In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
Deep Learning Through A Telescoping Lens: A Simple Model Provides Empirical Insights On Grokking, Gradient Boosting & Beyond [61.18736646013446]
In pursuit of a deeper understanding of its surprising behaviors, we investigate the utility of a simple yet accurate model of a trained neural network. Across three case studies, we illustrate how it can be applied to derive new empirical insights on a diverse range of prominent phenomena.
arXiv Detail & Related papers (2024-10-31T22:54:34Z)
Cross-Entropy Is All You Need To Invert the Data Generating Process [29.94396019742267]
Empirical phenomena suggest that supervised models can learn interpretable factors of variation in a linear fashion. Recent advances in self-supervised learning have shown that these methods can recover latent structures by inverting the data generating process. We prove that even in standard classification tasks, models learn representations of ground-truth factors of variation up to a linear transformation.
arXiv Detail & Related papers (2024-10-29T09:03:57Z)
A Unified Framework for Neural Computation and Learning Over Time [56.44910327178975]
Hamiltonian Learning is a novel unified framework for learning with neural networks "over time" It is based on differential equations that: (i) can be integrated without the need of external software solvers; (ii) generalize the well-established notion of gradient-based learning in feed-forward and recurrent networks; (iii) open to novel perspectives.
arXiv Detail & Related papers (2024-09-18T14:57:13Z)
Dynamics of Supervised and Reinforcement Learning in the Non-Linear Perceptron [3.069335774032178]
We use a dataset-process approach to derive flow equations describing learning. We characterize the effects of the learning rule (supervised or reinforcement learning, SL/RL) and input-data distribution on the perceptron's learning curve. This approach points a way toward analyzing learning dynamics for more-complex circuit architectures.
arXiv Detail & Related papers (2024-09-05T17:58:28Z)
Rethinking Dimensional Rationale in Graph Contrastive Learning from Causal Perspective [15.162584339143239]
Graph contrastive learning is a general learning paradigm excelling at capturing invariant information from diverse perturbations in graphs. Recent works focus on exploring the structural rationale from graphs, thereby increasing the discriminability of the invariant information. We propose the dimensional rationale-aware graph contrastive learning approach, which introduces a learnable dimensional rationale acquiring network and a redundancy reduction constraint.
arXiv Detail & Related papers (2023-12-16T10:05:18Z)
Learned Regularization for Inverse Problems: Insights from a Spectral Model [1.4963011898406866]
This chapter provides a theoretically founded investigation of state-of-the-art learning approaches for inverse problems. We give an extended definition of regularization methods and their convergence in terms of the underlying data distributions.
arXiv Detail & Related papers (2023-12-15T14:50:14Z)
On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods. We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z)
Envisioning Future Deep Learning Theories: Some Basic Concepts and Characteristics [30.365274034429508]
We argue that a future deep learning theory should inherit three characteristics: a textitarchhierically structured network architecture, parameters textititeratively optimized using gradient-based methods, and information from the data that evolves textitcompressively We integrate these characteristics into a graphical model called textitneurashed, which effectively explains some common empirical patterns in deep learning.
arXiv Detail & Related papers (2021-12-17T19:51:26Z)
A neural anisotropic view of underspecification in deep learning [60.119023683371736]
We show that the way neural networks handle the underspecification of problems is highly dependent on the data representation. Our results highlight that understanding the architectural inductive bias in deep learning is fundamental to address the fairness, robustness, and generalization of these systems.
arXiv Detail & Related papers (2021-04-29T14:31:09Z)
Nonparametric Estimation of Heterogeneous Treatment Effects: From Theory to Learning Algorithms [91.3755431537592]
We analyze four broad meta-learning strategies which rely on plug-in estimation and pseudo-outcome regression. We highlight how this theoretical reasoning can be used to guide principled algorithm design and translate our analyses into practice.
arXiv Detail & Related papers (2021-01-26T17:11:40Z)
Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks. We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task. Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.