Related papers: Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training?

Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training?

URL: http://arxiv.org/abs/2203.14577v1
Date: Mon, 28 Mar 2022 08:43:04 GMT
Title: Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training?
Authors: Jisoo Mok, Byunggook Na, Ji-Hoon Kim, Dongyoon Han, Sungroh Yoon
Abstract summary: In this work, we revisit several at-initialization metrics that can be derived from the Neural Tangent Kernel (NTK) We deduce that modern neural architectures exhibit highly non-linear characteristics, making the NTK-based metrics incapable of reliably estimating the performance of an architecture without some amount of training. We introduce Label-Gradient Alignment (LGA), a novel NTK-based metric whose inherent formulation allows it to capture the large amount of non-linear advantage present in modern neural architectures.
Score: 37.29036906991086
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In Neural Architecture Search (NAS), reducing the cost of architecture evaluation remains one of the most crucial challenges. Among a plethora of efforts to bypass training of each candidate architecture to convergence for evaluation, the Neural Tangent Kernel (NTK) is emerging as a promising theoretical framework that can be utilized to estimate the performance of a neural architecture at initialization. In this work, we revisit several at-initialization metrics that can be derived from the NTK and reveal their key shortcomings. Then, through the empirical analysis of the time evolution of NTK, we deduce that modern neural architectures exhibit highly non-linear characteristics, making the NTK-based metrics incapable of reliably estimating the performance of an architecture without some amount of training. To take such non-linear characteristics into account, we introduce Label-Gradient Alignment (LGA), a novel NTK-based metric whose inherent formulation allows it to capture the large amount of non-linear advantage present in modern neural architectures. With minimal amount of training, LGA obtains a meaningful level of rank correlation with the post-training test accuracy of an architecture. Lastly, we demonstrate that LGA, complemented with few epochs of training, successfully guides existing search algorithms to achieve competitive search performances with significantly less search cost. The code is available at: https://github.com/nutellamok/DemystifyingNTK.

Related papers

GradAlign for Training-free Model Performance Inference [11.578933730530832]
Training-free neural architecture search (NAS) aims to discover the ideal architecture without requiring extensive training. We introduce GradAlign, a simple yet effective method designed for inferring model performance without the need for training.
arXiv Detail & Related papers (2024-11-29T16:27:55Z)
Efficient kernel surrogates for neural network-based regression [0.8030359871216615]
We study the performance of the Conjugate Kernel (CK), an efficient approximation to the Neural Tangent Kernel (NTK) We show that the CK performance is only marginally worse than that of the NTK and, in certain cases, is shown to be superior. In addition to providing a theoretical grounding for using CKs instead of NTKs, our framework suggests a recipe for improving DNN accuracy inexpensively.
arXiv Detail & Related papers (2023-10-28T06:41:47Z)
Training-free Neural Architecture Search for RNNs and Transformers [0.0]
We develop a new training-free metric, named hidden covariance, that predicts the trained performance of an RNN architecture. We find that the current search space paradigm for transformer architectures is not optimized for training-free neural architecture search.
arXiv Detail & Related papers (2023-06-01T02:06:13Z)
Towards Theoretically Inspired Neural Initialization Optimization [66.04735385415427]
We propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network. We show that both the training and test performance of a network can be improved by maximizing GradCosine under norm constraint. Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost.
arXiv Detail & Related papers (2022-10-12T06:49:16Z)
On Feature Learning in Neural Networks with Global Convergence Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF) We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF. We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z)
KNAS: Green Neural Architecture Search [49.36732007176059]
We propose a new kernel based architecture search approach KNAS. Experiments show that KNAS achieves competitive results with orders of magnitude faster than "train-then-test" paradigms on image classification tasks. The searched network also outperforms strong baseline RoBERTA-large on two text classification tasks.
arXiv Detail & Related papers (2021-11-26T02:11:28Z)
Connection Sensitivity Matters for Training-free DARTS: From Architecture-Level Scoring to Operation-Level Sensitivity Analysis [32.94768616851585]
Recently proposed training-free NAS methods abandon the training phase and design various zero-cost proxies as scores to identify excellent architectures. In this paper, we raise an interesting problem: can we properly measure the operation importance in DARTS through a training-free way, with avoiding the parameter-intensive bias? By devising an iterative and data-agnostic manner in utilizing ZEROS for NAS, our novel trial leads to a framework called training free differentiable architecture search (FreeDARTS)
arXiv Detail & Related papers (2021-06-22T04:40:34Z)
Scaling Neural Tangent Kernels via Sketching and Random Features [53.57615759435126]
Recent works report that NTK regression can outperform finitely-wide neural networks trained on small-scale datasets. We design a near input-sparsity time approximation algorithm for NTK, by sketching the expansions of arc-cosine kernels. We show that a linear regressor trained on our CNTK features matches the accuracy of exact CNTK on CIFAR-10 dataset while achieving 150x speedup.
arXiv Detail & Related papers (2021-06-15T04:44:52Z)
The Nonlinearity Coefficient -- A Practical Guide to Neural Architecture Design [3.04585143845864]
We develop methods that can predict, without any training, whether an architecture will achieve a relatively high test or training error on a task after training. We then go on to explain the error in terms of the architecture definition itself and develop tools for modifying the architecture. Our first major contribution is to show that the 'degree of nonlinearity' of a neural architecture is a key causal driver behind its performance.
arXiv Detail & Related papers (2021-05-25T20:47:43Z)
Multi-objective Neural Architecture Search with Almost No Training [9.93048700248444]
We propose an effective alternative, dubbed Random-Weight Evaluation (RWE), to rapidly estimate the performance of network architectures. RWE reduces the computational cost of evaluating an architecture from hours to seconds. When integrated within an evolutionary multi-objective algorithm, RWE obtains a set of efficient architectures with state-of-the-art performance on CIFAR-10 with less than two hours' searching on a single GPU card.
arXiv Detail & Related papers (2020-11-27T07:39:17Z)
Multi-fidelity Neural Architecture Search with Knowledge Distillation [69.09782590880367]
We propose a bayesian multi-fidelity method for neural architecture search: MF-KD. Knowledge distillation adds to a loss function a term forcing a network to mimic some teacher network. We show that training for a few epochs with such a modified loss function leads to a better selection of neural architectures than training for a few epochs with a logistic loss.
arXiv Detail & Related papers (2020-06-15T12:32:38Z)
A Semi-Supervised Assessor of Neural Architectures [157.76189339451565]
We employ an auto-encoder to discover meaningful representations of neural architectures. A graph convolutional neural network is introduced to predict the performance of architectures.
arXiv Detail & Related papers (2020-05-14T09:02:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.