Diagonal Over-parameterization in Reproducing Kernel Hilbert Spaces as an Adaptive Feature Model: Generalization and Adaptivity
- URL: http://arxiv.org/abs/2501.08679v1
- Date: Wed, 15 Jan 2025 09:20:02 GMT
- Title: Diagonal Over-parameterization in Reproducing Kernel Hilbert Spaces as an Adaptive Feature Model: Generalization and Adaptivity
- Authors: Yicheng Li, Qian Lin,
- Abstract summary: diagonal adaptive kernel model learns kernel eigenvalues and output coefficients simultaneously during training.
We show that the adaptivity comes from learning the right eigenvalues during training.
- Score: 11.644182973599788
- License:
- Abstract: This paper introduces a diagonal adaptive kernel model that dynamically learns kernel eigenvalues and output coefficients simultaneously during training. Unlike fixed-kernel methods tied to the neural tangent kernel theory, the diagonal adaptive kernel model adapts to the structure of the truth function, significantly improving generalization over fixed-kernel methods, especially when the initial kernel is misaligned with the target. Moreover, we show that the adaptivity comes from learning the right eigenvalues during training, showing a feature learning behavior. By extending to deeper parameterization, we further show how extra depth enhances adaptability and generalization. This study combines the insights from feature learning and implicit regularization and provides new perspective into the adaptivity and generalization potential of neural networks beyond the kernel regime.
Related papers
- From Kernels to Features: A Multi-Scale Adaptive Theory of Feature Learning [3.7857410821449755]
This work presents a theoretical framework of multi-scale adaptive feature learning bridging different approaches.
A systematic expansion of the network's probability distribution reveals that mean-field scaling requires only a saddle-point approximation.
Remarkably, we find across regimes that kernel adaptation can be reduced to an effective kernel rescaling when predicting the mean network output of a linear network.
arXiv Detail & Related papers (2025-02-05T14:26:50Z) - Improving Adaptivity via Over-Parameterization in Sequence Models [11.644182973599788]
We show that even with the same set of eigenfunctions, the order of these functions significantly impacts regression outcomes.
We introduce an over- parameterized gradient descent in the realm of sequence model to capture the effects of various orders of a fixed set of eigen-functions.
arXiv Detail & Related papers (2024-09-02T02:11:52Z) - Function-Space Regularization in Neural Networks: A Probabilistic
Perspective [51.133793272222874]
We show that we can derive a well-motivated regularization technique that allows explicitly encoding information about desired predictive functions into neural network training.
We evaluate the utility of this regularization technique empirically and demonstrate that the proposed method leads to near-perfect semantic shift detection and highly-calibrated predictive uncertainty estimates.
arXiv Detail & Related papers (2023-12-28T17:50:56Z) - An Adaptive Tangent Feature Perspective of Neural Networks [4.900298402690262]
We consider linear transformations of features, resulting in a joint optimization over parameters and transformations with a bilinear constraint.
Specializing to neural network structure, we gain insights into how the features and thus the kernel function change.
We verify our theoretical observations in the kernel alignment of real neural networks.
arXiv Detail & Related papers (2023-08-29T17:57:20Z) - Dynamic Kernel-Based Adaptive Spatial Aggregation for Learned Image
Compression [63.56922682378755]
We focus on extending spatial aggregation capability and propose a dynamic kernel-based transform coding.
The proposed adaptive aggregation generates kernel offsets to capture valid information in the content-conditioned range to help transform.
Experimental results demonstrate that our method achieves superior rate-distortion performance on three benchmarks compared to the state-of-the-art learning-based methods.
arXiv Detail & Related papers (2023-08-17T01:34:51Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Promises and Pitfalls of the Linearized Laplace in Bayesian Optimization [73.80101701431103]
The linearized-Laplace approximation (LLA) has been shown to be effective and efficient in constructing Bayesian neural networks.
We study the usefulness of the LLA in Bayesian optimization and highlight its strong performance and flexibility.
arXiv Detail & Related papers (2023-04-17T14:23:43Z) - TANGOS: Regularizing Tabular Neural Networks through Gradient
Orthogonalization and Specialization [69.80141512683254]
We introduce Tabular Neural Gradient Orthogonalization and gradient (TANGOS)
TANGOS is a novel framework for regularization in the tabular setting built on latent unit attributions.
We demonstrate that our approach can lead to improved out-of-sample generalization performance, outperforming other popular regularization methods.
arXiv Detail & Related papers (2023-03-09T18:57:13Z) - Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide
Neural Networks [18.27510863075184]
We analyze feature learning in infinite width neural networks trained with gradient flow through a self-consistent dynamical field theory.
We construct a collection of deterministic dynamical order parameters which are inner-product kernels for hidden unit activations and gradients in each layer at pairs of time points.
arXiv Detail & Related papers (2022-05-19T16:10:10Z) - Structure Parameter Optimized Kernel Based Online Prediction with a
Generalized Optimization Strategy for Nonstationary Time Series [14.110902170321348]
sparsification techniques aided online prediction algorithms in a reproducing kernel Hilbert space are studied.
Online prediction algorithms as usual consist of the selection of kernel structure parameters and the kernel weight vector updating.
A generalized optimization strategy is designed to construct the kernel dictionary sequentially in multiple kernel connection modes.
arXiv Detail & Related papers (2021-08-18T14:46:31Z) - Kernel and Rich Regimes in Overparametrized Models [69.40899443842443]
We show that gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms.
We also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
arXiv Detail & Related papers (2020-02-20T15:43:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.