Bypass Exponential Time Preprocessing: Fast Neural Network Training via
Weight-Data Correlation Preprocessing
- URL: http://arxiv.org/abs/2211.14227v1
- Date: Fri, 25 Nov 2022 16:40:49 GMT
- Title: Bypass Exponential Time Preprocessing: Fast Neural Network Training via
Weight-Data Correlation Preprocessing
- Authors: Josh Alman, Jiehao Liang, Zhao Song, Ruizhe Zhang, Danyang Zhuo
- Abstract summary: State-of-art deep neural networks are becoming larger in size every year to deliver increasing model accuracy.
Recent work [Song, Yang and Zhang, NeurIPS 2021] reduces this time per iteration to $o(nmd)$, but requires exponential time to preprocess either the data or the neural network weights.
We present a new preprocessing method that simply stores the weight-data correlation in a tree data structure in order to quickly, dynamically detect which neurons fire at each iteration.
- Score: 16.35997749365563
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over the last decade, deep neural networks have transformed our society, and
they are already widely applied in various machine learning applications.
State-of-art deep neural networks are becoming larger in size every year to
deliver increasing model accuracy, and as a result, model training consumes
substantial computing resources and will only consume more in the future. Using
current training methods, in each iteration, to process a data point $x \in
\mathbb{R}^d$ in a layer, we need to spend $\Theta(md)$ time to evaluate all
the $m$ neurons in the layer. This means processing the entire layer takes
$\Theta(nmd)$ time for $n$ data points. Recent work [Song, Yang and Zhang,
NeurIPS 2021] reduces this time per iteration to $o(nmd)$, but requires
exponential time to preprocess either the data or the neural network weights,
making it unlikely to have practical usage.
In this work, we present a new preprocessing method that simply stores the
weight-data correlation in a tree data structure in order to quickly,
dynamically detect which neurons fire at each iteration. Our method requires
only $O(nmd)$ time in preprocessing and still achieves $o(nmd)$ time per
iteration. We complement our new algorithm with a lower bound, proving that
assuming a popular conjecture from complexity theory, one could not
substantially speed up our algorithm for dynamic detection of firing neurons.
Related papers
- Neural-g: A Deep Learning Framework for Mixing Density Estimation [16.464806944964003]
Mixing (or prior) density estimation is an important problem in machine learning and statistics.
We propose neural-$g$, a new neural network-based estimator for $g$-modeling.
arXiv Detail & Related papers (2024-06-10T03:00:28Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Moccasin: Efficient Tensor Rematerialization for Neural Networks [21.348126106786914]
We develop a new constraint programming formulation called textscMoccasin with only $O(n)$ integer variables.
We present numerical studies that show that our approach is up to an order of magnitude faster than recent work especially for large-scale graphs.
arXiv Detail & Related papers (2023-04-27T18:41:37Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - A Sublinear Adversarial Training Algorithm [13.42699247306472]
We analyze the convergence guarantee of adversarial training procedure on a two-layer neural network with shifted ReLU activation.
We develop an algorithm for adversarial training with time cost $o(m n d)$ per iteration by applying half-space reporting data structure.
arXiv Detail & Related papers (2022-08-10T15:31:40Z) - Training Overparametrized Neural Networks in Sublinear Time [14.918404733024332]
Deep learning comes at a tremendous computational and energy cost.
We present a new and a subset of binary neural networks, as a small subset of search trees, where each corresponds to a subset of search trees (Ds)
We believe this view would have further applications in analysis analysis of deep networks (Ds)
arXiv Detail & Related papers (2022-08-09T02:29:42Z) - Neural Capacitance: A New Perspective of Neural Network Selection via
Edge Dynamics [85.31710759801705]
Current practice requires expensive computational costs in model training for performance prediction.
We propose a novel framework for neural network selection by analyzing the governing dynamics over synaptic connections (edges) during training.
Our framework is built on the fact that back-propagation during neural network training is equivalent to the dynamical evolution of synaptic connections.
arXiv Detail & Related papers (2022-01-11T20:53:15Z) - Does Preprocessing Help Training Over-parameterized Neural Networks? [19.64638346701198]
We propose two novel preprocessing ideas to bypass the $Omega(mnd)$ barrier.
Our results provide theoretical insights for a large number of previously established fast training methods.
arXiv Detail & Related papers (2021-10-09T18:16:23Z) - A quantum algorithm for training wide and deep classical neural networks [72.2614468437919]
We show that conditions amenable to classical trainability via gradient descent coincide with those necessary for efficiently solving quantum linear systems.
We numerically demonstrate that the MNIST image dataset satisfies such conditions.
We provide empirical evidence for $O(log n)$ training of a convolutional neural network with pooling.
arXiv Detail & Related papers (2021-07-19T23:41:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.