Scaling Laws Beyond Backpropagation
- URL: http://arxiv.org/abs/2210.14593v1
- Date: Wed, 26 Oct 2022 10:09:14 GMT
- Title: Scaling Laws Beyond Backpropagation
- Authors: Matthew J. Filipovich, Alessandro Cappelli, Daniel Hesslow, Julien
Launay
- Abstract summary: We study the ability of Direct Feedback Alignment to train causal decoder-only Transformers efficiently.
We find that DFA fails to offer more efficient scaling than backpropagation.
- Score: 64.0476282000118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Alternatives to backpropagation have long been studied to better understand
how biological brains may learn. Recently, they have also garnered interest as
a way to train neural networks more efficiently. By relaxing constraints
inherent to backpropagation (e.g., symmetric feedforward and feedback weights,
sequential updates), these methods enable promising prospects, such as local
learning. However, the tradeoffs between different methods in terms of final
task performance, convergence speed, and ultimately compute and data
requirements are rarely outlined. In this work, we use scaling laws to study
the ability of Direct Feedback Alignment~(DFA) to train causal decoder-only
Transformers efficiently. Scaling laws provide an overview of the tradeoffs
implied by a modeling decision, up to extrapolating how it might transfer to
increasingly large models. We find that DFA fails to offer more efficient
scaling than backpropagation: there is never a regime for which the degradation
in loss incurred by using DFA is worth the potential reduction in compute
budget. Our finding comes at variance with previous beliefs in the alternative
training methods community, and highlights the need for holistic empirical
approaches to better understand modeling decisions.
Related papers
- Selecting Large Language Model to Fine-tune via Rectified Scaling Law [74.84096546112215]
Given constrained resources, fine-tuning all models and making selections afterward is unrealistic.
We find that the fine-tuning scaling curve includes not just the well-known "power phase" but also the previously unobserved "pre-power phase"
By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption.
arXiv Detail & Related papers (2024-02-04T01:55:00Z) - From Hope to Safety: Unlearning Biases of Deep Models via Gradient
Penalization in Latent Space [13.763716495058294]
Deep Neural Networks are prone to learning spurious correlations embedded in the training data, leading to potentially biased predictions.
This poses risks when deploying these models for high-stake decision-making, such as in medical applications.
We present a novel method for model correction on the concept level that explicitly reduces model sensitivity towards biases via gradient penalization.
arXiv Detail & Related papers (2023-08-18T10:07:46Z) - Learning to Optimize Permutation Flow Shop Scheduling via Graph-based
Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems.
We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately.
Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z) - Optimal Decision Diagrams for Classification [68.72078059880018]
We study the training of optimal decision diagrams from a mathematical programming perspective.
We introduce a novel mixed-integer linear programming model for training.
We show how this model can be easily extended for fairness, parsimony, and stability notions.
arXiv Detail & Related papers (2022-05-28T18:31:23Z) - Improving the Efficiency of Off-Policy Reinforcement Learning by
Accounting for Past Decisions [20.531576904743282]
Off-policy estimation bias is corrected in a per-decision manner.
Off-policy algorithms such as Tree Backup and Retrace rely on this mechanism.
We propose a multistep operator that permits arbitrary past-dependent traces.
arXiv Detail & Related papers (2021-12-23T00:07:28Z) - Training Feedback Spiking Neural Networks by Implicit Differentiation on
the Equilibrium State [66.2457134675891]
Spiking neural networks (SNNs) are brain-inspired models that enable energy-efficient implementation on neuromorphic hardware.
Most existing methods imitate the backpropagation framework and feedforward architectures for artificial neural networks.
We propose a novel training method that does not rely on the exact reverse of the forward computation.
arXiv Detail & Related papers (2021-09-29T07:46:54Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.