Explicit Foundation Model Optimization with Self-Attentive Feed-Forward
Neural Units
- URL: http://arxiv.org/abs/2311.07510v1
- Date: Mon, 13 Nov 2023 17:55:07 GMT
- Title: Explicit Foundation Model Optimization with Self-Attentive Feed-Forward
Neural Units
- Authors: Jake Ryland Williams and Haoran Zhao
- Abstract summary: Iterative approximation methods using backpropagation enable the optimization of neural networks, but they remain computationally expensive when used at scale.
This paper presents an efficient alternative for optimizing neural networks that reduces the costs of scaling neural networks and provides high-efficiency optimizations for low-resource applications.
- Score: 4.807347156077897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Iterative approximation methods using backpropagation enable the optimization
of neural networks, but they remain computationally expensive, especially when
used at scale. This paper presents an efficient alternative for optimizing
neural networks that reduces the costs of scaling neural networks and provides
high-efficiency optimizations for low-resource applications. We will discuss a
general result about feed-forward neural networks and then extend this solution
to compositional (mult-layer) networks, which are applied to a simplified
transformer block containing feed-forward and self-attention layers. These
models are used to train highly-specified and complex multi-layer neural
architectures that we refer to as self-attentive feed-forward unit (SAFFU)
layers, which we use to develop a transformer that appears to generalize well
over small, cognitively-feasible, volumes of data. Testing demonstrates
explicit solutions outperform models optimized by backpropagation alone.
Moreover, further application of backpropagation after explicit solutions leads
to better optima from smaller scales of data, training effective models from
much less data is enabled by explicit solution warm starts. We then carry out
ablation experiments training a roadmap of about 250 transformer models over
1-million tokens to determine ideal settings. We find that multiple different
architectural variants produce highly-performant models, and discover from this
ablation that some of the best are not the most parameterized. This appears to
indicate well-generalized models could be reached using less data by using
explicit solutions, and that architectural exploration using explicit solutions
pays dividends in guiding the search for efficient variants with fewer
parameters, and which could be incorporated into low-resource hardware where AI
might be embodied.
Related papers
- Diffusion Models as Network Optimizers: Explorations and Analysis [71.69869025878856]
generative diffusion models (GDMs) have emerged as a promising new approach to network optimization.
In this study, we first explore the intrinsic characteristics of generative models.
We provide a concise theoretical and intuitive demonstration of the advantages of generative models over discriminative network optimization.
arXiv Detail & Related papers (2024-11-01T09:05:47Z) - DiffSG: A Generative Solver for Network Optimization with Diffusion Model [75.27274046562806]
Diffusion generative models can consider a broader range of solutions and exhibit stronger generalization by learning parameters.
We propose a new framework, which leverages intrinsic distribution learning of diffusion generative models to learn high-quality solutions.
arXiv Detail & Related papers (2024-08-13T07:56:21Z) - Reducing the Need for Backpropagation and Discovering Better Optima With
Explicit Optimizations of Neural Networks [4.807347156077897]
We propose a computationally efficient alternative for optimizing neural networks.
We derive an explicit solution to a simple feed-forward language model.
We show that explicit solutions perform near-optimality in experiments.
arXiv Detail & Related papers (2023-11-13T17:38:07Z) - Soft Merging: A Flexible and Robust Soft Model Merging Approach for
Enhanced Neural Network Performance [6.599368083393398]
Gradient (SGD) is often limited to converging local optima to improve model performance.
em soft merging method minimizes the obtained local optima models in undesirable results.
Experiments underscore the effectiveness of the merged networks.
arXiv Detail & Related papers (2023-09-21T17:07:31Z) - Precision Machine Learning [5.15188009671301]
We compare various function approximation methods and study how they scale with increasing parameters and data.
We find that neural networks can often outperform classical approximation methods on high-dimensional examples.
We develop training tricks which enable us to train neural networks to extremely low loss, close to the limits allowed by numerical precision.
arXiv Detail & Related papers (2022-10-24T17:58:30Z) - Learning to Learn with Generative Models of Neural Network Checkpoints [71.06722933442956]
We construct a dataset of neural network checkpoints and train a generative model on the parameters.
We find that our approach successfully generates parameters for a wide range of loss prompts.
We apply our method to different neural network architectures and tasks in supervised and reinforcement learning.
arXiv Detail & Related papers (2022-09-26T17:59:58Z) - EvoPruneDeepTL: An Evolutionary Pruning Model for Transfer Learning
based Deep Neural Networks [15.29595828816055]
We propose an evolutionary pruning model for Transfer Learning based Deep Neural Networks.
EvoPruneDeepTL replaces the last fully-connected layers with sparse layers optimized by a genetic algorithm.
Results show the contribution of EvoPruneDeepTL and feature selection to the overall computational efficiency of the network.
arXiv Detail & Related papers (2022-02-08T13:07:55Z) - Non-Gradient Manifold Neural Network [79.44066256794187]
Deep neural network (DNN) generally takes thousands of iterations to optimize via gradient descent.
We propose a novel manifold neural network based on non-gradient optimization.
arXiv Detail & Related papers (2021-06-15T06:39:13Z) - Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems [83.98774574197613]
We take one of the simplest inference methods, a truncated max-product Belief propagation, and add what is necessary to make it a proper component of a deep learning model.
This BP-Layer can be used as the final or an intermediate block in convolutional neural networks (CNNs)
The model is applicable to a range of dense prediction problems, is well-trainable and provides parameter-efficient and robust solutions in stereo, optical flow and semantic segmentation.
arXiv Detail & Related papers (2020-03-13T13:11:35Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.