LoGAH: Predicting 774-Million-Parameter Transformers using Graph HyperNetworks with 1/100 Parameters
- URL: http://arxiv.org/abs/2405.16287v1
- Date: Sat, 25 May 2024 15:56:15 GMT
- Title: LoGAH: Predicting 774-Million-Parameter Transformers using Graph HyperNetworks with 1/100 Parameters
- Authors: Xinyu Zhou, Boris Knyazev, Alexia Jolicoeur-Martineau, Jie Fu,
- Abstract summary: Graph HyperNetworks (GHNs) have recently shown strong performance in initializing large vision models.
LoGAH allows us to predict the parameters of 774-million large neural networks in a memory-efficient manner.
- Score: 31.55846326336193
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A good initialization of deep learning models is essential since it can help them converge better and faster. However, pretraining large models is unaffordable for many researchers, which makes a desired prediction for initial parameters more necessary nowadays. Graph HyperNetworks (GHNs), one approach to predicting model parameters, have recently shown strong performance in initializing large vision models. Unfortunately, predicting parameters of very wide networks relies on copying small chunks of parameters multiple times and requires an extremely large number of parameters to support full prediction, which greatly hinders its adoption in practice. To address this limitation, we propose LoGAH (Low-rank GrAph Hypernetworks), a GHN with a low-rank parameter decoder that expands to significantly wider networks without requiring as excessive increase of parameters as in previous attempts. LoGAH allows us to predict the parameters of 774-million large neural networks in a memory-efficient manner. We show that vision and language models (i.e., ViT and GPT-2) initialized with LoGAH achieve better performance than those initialized randomly or using existing hypernetworks. Furthermore, we show promising transfer learning results w.r.t. training LoGAH on small datasets and using the predicted parameters to initialize for larger tasks. We provide the codes in https://github.com/Blackzxy/LoGAH .
Related papers
- LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Recurrent Diffusion for Large-Scale Parameter Generation [52.98888368644455]
We introduce Recurrent Diffusion for Large Scale Generation (RPG), a novel framework that generates full neural network parameters up to hundreds of millions on a single GPU.
RPG serves as a critical advance in AI generating AI, potentially enabling efficient weight generation at scales previously deemed infeasible.
arXiv Detail & Related papers (2025-01-20T16:46:26Z) - Adjusting Model Size in Continual Gaussian Processes: How Big is Big Enough? [11.43983519639935]
Many machine learning models require setting a parameter that controls their size before training.
This leads to the question How big is big enough?''
Here, data becomes available incrementally, and the final dataset size will therefore not be known before training.
We develop a method to automatically adjust model size while maintaining near optimal performance.
arXiv Detail & Related papers (2024-08-14T14:40:00Z) - Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work.
Our empirical investigation includes tens of thousands of models trained with all combinations of threes.
We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z) - ParameterNet: Parameters Are All You Need [50.150436250355945]
We introduce a novel design principle, termed Net, aimed at augmenting the number of parameters in large-scale visual pretraining models.
We leverage dynamic convolutions to incorporate additional parameters into the networks with only a marginal rise in FLOPs.
The Net approach allows low-FLOPs networks to take advantage of large-scale visual pretraining.
arXiv Detail & Related papers (2023-06-26T09:01:35Z) - nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales [65.01417261415833]
We present an approach to predict the pre-training loss based on our observations that Maximal Update Parametrization (muP) enables accurate fitting of scaling laws.
With around 14% of the one-time pre-training cost, we can accurately forecast the loss for models up to 52B.
Our goal with nanoLM is to empower researchers with limited resources to reach meaningful conclusions on large models.
arXiv Detail & Related papers (2023-04-14T00:45:01Z) - Can We Scale Transformers to Predict Parameters of Diverse ImageNet
Models? [23.668513148189344]
We release a single neural network that can predict high quality parameters of other neural networks.
We are able to boost training of diverse ImageNet models available in PyTorch.
When transferred to other datasets, models with predicted parameters also converge faster and reach competitive final performance.
arXiv Detail & Related papers (2023-03-07T18:56:59Z) - Learning to Learn with Generative Models of Neural Network Checkpoints [71.06722933442956]
We construct a dataset of neural network checkpoints and train a generative model on the parameters.
We find that our approach successfully generates parameters for a wide range of loss prompts.
We apply our method to different neural network architectures and tasks in supervised and reinforcement learning.
arXiv Detail & Related papers (2022-09-26T17:59:58Z) - Pretraining a Neural Network before Knowing Its Architecture [2.170169149901781]
Training large neural networks is possible by training a smaller hypernetwork that predicts parameters for the large ones.
A recently released Graph HyperNetwork (GHN) trained this way on one million smaller ImageNet architectures is able to predict parameters for large unseen networks such as ResNet-50.
While networks with predicted parameters lose performance on the source task, the predicted parameters have been found useful for fine-tuning on other tasks.
arXiv Detail & Related papers (2022-07-20T17:27:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.