Related papers: Scaling Laws Beyond Backpropagation

Scaling Laws Beyond Backpropagation

URL: http://arxiv.org/abs/2210.14593v1
Date: Wed, 26 Oct 2022 10:09:14 GMT
Title: Scaling Laws Beyond Backpropagation
Authors: Matthew J. Filipovich, Alessandro Cappelli, Daniel Hesslow, Julien Launay
Abstract summary: We study the ability of Direct Feedback Alignment to train causal decoder-only Transformers efficiently. We find that DFA fails to offer more efficient scaling than backpropagation.
Score: 64.0476282000118
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Alternatives to backpropagation have long been studied to better understand how biological brains may learn. Recently, they have also garnered interest as a way to train neural networks more efficiently. By relaxing constraints inherent to backpropagation (e.g., symmetric feedforward and feedback weights, sequential updates), these methods enable promising prospects, such as local learning. However, the tradeoffs between different methods in terms of final task performance, convergence speed, and ultimately compute and data requirements are rarely outlined. In this work, we use scaling laws to study the ability of Direct Feedback Alignment~(DFA) to train causal decoder-only Transformers efficiently. Scaling laws provide an overview of the tradeoffs implied by a modeling decision, up to extrapolating how it might transfer to increasingly large models. We find that DFA fails to offer more efficient scaling than backpropagation: there is never a regime for which the degradation in loss incurred by using DFA is worth the potential reduction in compute budget. Our finding comes at variance with previous beliefs in the alternative training methods community, and highlights the need for holistic empirical approaches to better understand modeling decisions.

Related papers

Selecting Large Language Model to Fine-tune via Rectified Scaling Law [74.84096546112215]
Given constrained resources, fine-tuning all models and making selections afterward is unrealistic. We find that the fine-tuning scaling curve includes not just the well-known "power phase" but also the previously unobserved "pre-power phase" By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption.
arXiv Detail & Related papers (2024-02-04T01:55:00Z)
From Hope to Safety: Unlearning Biases of Deep Models via Gradient Penalization in Latent Space [13.763716495058294]
Deep Neural Networks are prone to learning spurious correlations embedded in the training data, leading to potentially biased predictions. This poses risks when deploying these models for high-stake decision-making, such as in medical applications. We present a novel method for model correction on the concept level that explicitly reduces model sensitivity towards biases via gradient penalization.
arXiv Detail & Related papers (2023-08-18T10:07:46Z)
Learning to Optimize Permutation Flow Shop Scheduling via Graph-based Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems. We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately. Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z)
Optimal Decision Diagrams for Classification [68.72078059880018]
We study the training of optimal decision diagrams from a mathematical programming perspective. We introduce a novel mixed-integer linear programming model for training. We show how this model can be easily extended for fairness, parsimony, and stability notions.
arXiv Detail & Related papers (2022-05-28T18:31:23Z)
Improving the Efficiency of Off-Policy Reinforcement Learning by Accounting for Past Decisions [20.531576904743282]
Off-policy estimation bias is corrected in a per-decision manner. Off-policy algorithms such as Tree Backup and Retrace rely on this mechanism. We propose a multistep operator that permits arbitrary past-dependent traces.
arXiv Detail & Related papers (2021-12-23T00:07:28Z)
Training Feedback Spiking Neural Networks by Implicit Differentiation on the Equilibrium State [66.2457134675891]
Spiking neural networks (SNNs) are brain-inspired models that enable energy-efficient implementation on neuromorphic hardware. Most existing methods imitate the backpropagation framework and feedforward architectures for artificial neural networks. We propose a novel training method that does not rely on the exact reverse of the forward computation.
arXiv Detail & Related papers (2021-09-29T07:46:54Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.