Toward Large Kernel Models
- URL: http://arxiv.org/abs/2302.02605v3
- Date: Tue, 20 Jun 2023 03:07:15 GMT
- Title: Toward Large Kernel Models
- Authors: Amirhesam Abedsoltan, Mikhail Belkin, Parthe Pandit
- Abstract summary: We introduce EigenPro 3.0, an algorithm based on projected dual preconditioned SGD.
We show scaling to model and data sizes which have not been possible with existing kernel methods.
- Score: 16.704246627541103
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies indicate that kernel machines can often perform similarly or
better than deep neural networks (DNNs) on small datasets. The interest in
kernel machines has been additionally bolstered by the discovery of their
equivalence to wide neural networks in certain regimes. However, a key feature
of DNNs is their ability to scale the model size and training data size
independently, whereas in traditional kernel machines model size is tied to
data size. Because of this coupling, scaling kernel machines to large data has
been computationally challenging. In this paper, we provide a way forward for
constructing large-scale general kernel models, which are a generalization of
kernel machines that decouples the model and data, allowing training on large
datasets. Specifically, we introduce EigenPro 3.0, an algorithm based on
projected dual preconditioned SGD and show scaling to model and data sizes
which have not been possible with existing kernel methods.
Related papers
- Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference [55.150117654242706]
We show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU.
As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty.
arXiv Detail & Related papers (2024-11-01T21:11:48Z) - Faithful and Efficient Explanations for Neural Networks via Neural
Tangent Kernel Surrogate Models [7.608408123113268]
We analyze approximate empirical neural tangent kernels (eNTK) for data attribution.
We introduce two new random projection variants of approximate eNTK which allow users to tune the time and memory complexity of their calculation.
We conclude that kernel machines using approximate neural tangent kernel as the kernel function are effective surrogate models.
arXiv Detail & Related papers (2023-05-23T23:51:53Z) - Graph Neural Network-Inspired Kernels for Gaussian Processes in
Semi-Supervised Learning [4.644263115284322]
Graph neural networks (GNNs) emerged recently as a promising class of models for graph-structured data in semi-supervised learning.
We introduce this inductive bias into GPs to improve their predictive performance for graph-structured data.
We show that these graph-based kernels lead to competitive classification and regression performance, as well as advantages in time, compared with the respective GNNs.
arXiv Detail & Related papers (2023-02-12T01:07:56Z) - Neural Attentive Circuits [93.95502541529115]
We introduce a general purpose, yet modular neural architecture called Neural Attentive Circuits (NACs)
NACs learn the parameterization and a sparse connectivity of neural modules without using domain knowledge.
NACs achieve an 8x speedup at inference time while losing less than 3% performance.
arXiv Detail & Related papers (2022-10-14T18:00:07Z) - On-Device Domain Generalization [93.79736882489982]
Domain generalization is critical to on-device machine learning applications.
We find that knowledge distillation is a strong candidate for solving the problem.
We propose a simple idea called out-of-distribution knowledge distillation (OKD), which aims to teach the student how the teacher handles (synthetic) out-of-distribution data.
arXiv Detail & Related papers (2022-09-15T17:59:31Z) - B\'ezier Gaussian Processes for Tall and Wide Data [24.00638575411818]
We introduce a kernel that allows the number of summarising variables to grow exponentially with the number of input features.
We show that our kernel has close similarities to some of the most used kernels in Gaussian process regression.
arXiv Detail & Related papers (2022-09-01T10:22:14Z) - Kernel Methods and Multi-layer Perceptrons Learn Linear Models in High
Dimensions [25.635225717360466]
We show that for a large class of kernels, including the neural kernel of fully connected networks, kernel methods can only perform as well as linear models in a certain high-dimensional regime.
More complex models for the data other than independent features are needed for high-dimensional analysis.
arXiv Detail & Related papers (2022-01-20T09:35:46Z) - Rank-R FNN: A Tensor-Based Learning Model for High-Order Data
Classification [69.26747803963907]
Rank-R Feedforward Neural Network (FNN) is a tensor-based nonlinear learning model that imposes Canonical/Polyadic decomposition on its parameters.
First, it handles inputs as multilinear arrays, bypassing the need for vectorization, and can thus fully exploit the structural information along every data dimension.
We establish the universal approximation and learnability properties of Rank-R FNN, and we validate its performance on real-world hyperspectral datasets.
arXiv Detail & Related papers (2021-04-11T16:37:32Z) - Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network.
We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z) - Bayesian Sparse Factor Analysis with Kernelized Observations [67.60224656603823]
Multi-view problems can be faced with latent variable models.
High-dimensionality and non-linear issues are traditionally handled by kernel methods.
We propose merging both approaches into single model.
arXiv Detail & Related papers (2020-06-01T14:25:38Z) - Omni-Scale CNNs: a simple and effective kernel size configuration for
time series classification [47.423272376757204]
The Receptive Field (RF) size has been one of the most important factors for One Dimensional Convolutional Neural Networks (1D-CNNs) on time series classification tasks.
We propose an Omni-Scale block (OS-block) for 1D-CNNs, where the kernel sizes are decided by a simple and universal rule.
Experiment result shows that models with the OS-block can achieve a similar performance as models with the searched optimal RF size.
arXiv Detail & Related papers (2020-02-24T03:33:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.