Dropout Inference with Non-Uniform Weight Scaling
- URL: http://arxiv.org/abs/2204.13047v1
- Date: Wed, 27 Apr 2022 16:41:12 GMT
- Title: Dropout Inference with Non-Uniform Weight Scaling
- Authors: Zhaoyuan Yang and Arpit Jain
- Abstract summary: Dropout as regularization has been used extensively to prevent overfitting for training neural networks.
In this work, we demonstrate scenarios where some submodels behave closer to high-bias models and a non-uniform weight scaling is a better approximation for inference.
- Score: 6.726255259929496
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Dropout as regularization has been used extensively to prevent overfitting
for training neural networks. During training, units and their connections are
randomly dropped, which could be considered as sampling many different
submodels from the original model. At test time, weight scaling and Monte Carlo
approximation are two widely applied approaches to approximate the outputs.
Both approaches work well practically when all submodels are low-bias complex
learners. However, in this work, we demonstrate scenarios where some submodels
behave closer to high-bias models and a non-uniform weight scaling is a better
approximation for inference.
Related papers
- A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.
We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z) - Weight Scope Alignment: A Frustratingly Easy Method for Model Merging [40.080926444789085]
Non-I.I.D. data poses a huge challenge for averaging-based model fusion.
In this paper, we reveal variations in weight scope under different training conditions, shedding light on its influence on model merging.
Fortunately, the parameters in each layer basically follow the Gaussian distribution, which inspires a novel and simple regularization approach.
arXiv Detail & Related papers (2024-08-22T09:13:27Z) - Scalable Ensembling For Mitigating Reward Overoptimisation [24.58937616758007]
Reinforcement Learning from Human Feedback has enabled significant advancements within language modeling for powerful, instruction-following models.
The alignment of these models remains a pressing challenge as the policy tends to overfit the learned proxy" reward model past an inflection point of utility.
arXiv Detail & Related papers (2024-06-03T05:46:53Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Investigating Ensemble Methods for Model Robustness Improvement of Text
Classifiers [66.36045164286854]
We analyze a set of existing bias features and demonstrate there is no single model that works best for all the cases.
By choosing an appropriate bias model, we can obtain a better robustness result than baselines with a more sophisticated model design.
arXiv Detail & Related papers (2022-10-28T17:52:10Z) - Model soups: averaging weights of multiple fine-tuned models improves
accuracy without increasing inference time [69.7693300927423]
We show that averaging the weights of multiple models fine-tuned with different hyper parameter configurations improves accuracy and robustness.
We show that the model soup approach extends to multiple image classification and natural language processing tasks.
arXiv Detail & Related papers (2022-03-10T17:03:49Z) - Optimal Model Averaging: Towards Personalized Collaborative Learning [0.0]
In federated learning, differences in the data or objectives between the participating nodes motivate approaches to train a personalized machine learning model for each node.
One such approach is weighted averaging between a locally trained model and the global model.
We find that there is always some positive amount of model averaging that reduces the expected squared error compared to the local model.
arXiv Detail & Related papers (2021-10-25T13:33:20Z) - No One Representation to Rule Them All: Overlapping Features of Training
Methods [12.58238785151714]
High-performing models tend to make similar predictions regardless of training methodology.
Recent work has made very different training techniques, such as large-scale contrastive learning, yield competitively-high accuracy.
We show these models specialize in generalization of the data, leading to higher ensemble performance.
arXiv Detail & Related papers (2021-10-20T21:29:49Z) - When in Doubt, Summon the Titans: Efficient Inference with Large Models [80.2673230098021]
We propose a two-stage framework based on distillation that realizes the modelling benefits of large models.
We use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of "easy" examples.
Our proposed use of distillation to only handle easy instances allows for a more aggressive trade-off in the student size, thereby reducing the amortized cost of inference.
arXiv Detail & Related papers (2021-10-19T22:56:49Z) - When Ensembling Smaller Models is More Efficient than Single Large
Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute.
This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.