Scaling Laws vs Model Architectures: How does Inductive Bias Influence
Scaling?
- URL: http://arxiv.org/abs/2207.10551v1
- Date: Thu, 21 Jul 2022 15:50:22 GMT
- Title: Scaling Laws vs Model Architectures: How does Inductive Bias Influence
Scaling?
- Authors: Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William
Fedus, Jinfeng Rao, Sharan Narang, Vinh Q. Tran, Dani Yogatama, Donald
Metzler
- Abstract summary: This paper conducts a systematic study of scaling behaviour of ten diverse model architectures.
We show that architecture is an important consideration when performing scaling and the best performing model can fluctuate at different scales.
- Score: 91.78878523252897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There have been a lot of interest in the scaling properties of Transformer
models. However, not much has been done on the front of investigating the
effect of scaling properties of different inductive biases and model
architectures. Do model architectures scale differently? If so, how does
inductive bias affect scaling behaviour? How does this influence upstream
(pretraining) and downstream (transfer)? This paper conducts a systematic study
of scaling behaviour of ten diverse model architectures such as Transformers,
Switch Transformers, Universal Transformers, Dynamic convolutions, Performers,
and recently proposed MLP-Mixers. Via extensive experiments, we show that (1)
architecture is an indeed an important consideration when performing scaling
and (2) the best performing model can fluctuate at different scales. We believe
that the findings outlined in this work has significant implications to how
model architectures are currently evaluated in the community.
Related papers
- Setting the Record Straight on Transformer Oversmoothing [35.125957267464756]
As model depth increases, Transformers oversmooth, i.e., inputs become more and more similar.
We show that smoothing behavior depends on the eigenspectrum of the value and projection weights.
Our analysis reveals a simple way to parameterize the weights of the Transformer update equations to influence smoothing behavior.
arXiv Detail & Related papers (2024-01-09T01:19:03Z) - What Language Model to Train if You Have One Million GPU Hours? [54.32062236748831]
We study the impact of different modeling practices and their impact on zero-shot generalization.
We also study the performance of a multilingual model and how it compares to the English-only one.
All our models and code are open-sourced at https://huggingface.co/bigscience.
arXiv Detail & Related papers (2022-10-27T13:43:27Z) - The Lie Derivative for Measuring Learned Equivariance [84.29366874540217]
We study the equivariance properties of hundreds of pretrained models, spanning CNNs, transformers, and Mixer architectures.
We find that many violations of equivariance can be linked to spatial aliasing in ubiquitous network layers, such as pointwise non-linearities.
For example, transformers can be more equivariant than convolutional neural networks after training.
arXiv Detail & Related papers (2022-10-06T15:20:55Z) - Understanding Scaling Laws for Recommendation Models [1.6283945233720964]
We study empirical scaling laws for DLRM style recommendation models, in particular Click-Through Rate (CTR)
We characterize scaling efficiency along three different resource dimensions, namely data, parameters and compute.
We show that parameter scaling is out of steam for the model architecture under study, and until a higher-performing model architecture emerges, data scaling is the path forward.
arXiv Detail & Related papers (2022-08-17T19:13:17Z) - What do Toothbrushes do in the Kitchen? How Transformers Think our World
is Structured [137.83584233680116]
We investigate what extent transformer-based language models allow for extracting knowledge about object relations.
We show that the models combined with the different similarity measures differ greatly in terms of the amount of knowledge they allow for extracting.
Surprisingly, static models perform almost as well as contextualized models -- in some cases even better.
arXiv Detail & Related papers (2022-04-12T10:00:20Z) - Data Scaling Laws in NMT: The Effect of Noise and Architecture [59.767899982937756]
We study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT)
We find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data.
arXiv Detail & Related papers (2022-02-04T06:53:49Z) - Scale Efficiently: Insights from Pre-training and Fine-tuning
Transformers [57.931830650323]
This paper presents scaling insights from pretraining and finetuning Transformers.
We show that aside from only the model size, model shape matters for downstream fine-tuning.
We present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality.
arXiv Detail & Related papers (2021-09-22T12:29:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.