Hardware Beyond Backpropagation: a Photonic Co-Processor for Direct
Feedback Alignment
- URL: http://arxiv.org/abs/2012.06373v1
- Date: Fri, 11 Dec 2020 14:20:45 GMT
- Title: Hardware Beyond Backpropagation: a Photonic Co-Processor for Direct
Feedback Alignment
- Authors: Julien Launay, Iacopo Poli, Kilian M\"uller, Gustave Pariente, Igor
Carron, Laurent Daudet, Florent Krzakala, Sylvain Gigan
- Abstract summary: We present a photonic accelerator for Direct Feedback Alignment, able to compute random projections with trillions of parameters.
This is a significant step towards building scalable hardware, able to go beyond backpropagation.
- Score: 26.65651157173834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The scaling hypothesis motivates the expansion of models past trillions of
parameters as a path towards better performance. Recent significant
developments, such as GPT-3, have been driven by this conjecture. However, as
models scale-up, training them efficiently with backpropagation becomes
difficult. Because model, pipeline, and data parallelism distribute parameters
and gradients over compute nodes, communication is challenging to orchestrate:
this is a bottleneck to further scaling. In this work, we argue that
alternative training methods can mitigate these issues, and can inform the
design of extreme-scale training hardware. Indeed, using a synaptically
asymmetric method with a parallelizable backward pass, such as Direct Feedback
Alignement, communication needs are drastically reduced. We present a photonic
accelerator for Direct Feedback Alignment, able to compute random projections
with trillions of parameters. We demonstrate our system on benchmark tasks,
using both fully-connected and graph convolutional networks. Our hardware is
the first architecture-agnostic photonic co-processor for training neural
networks. This is a significant step towards building scalable hardware, able
to go beyond backpropagation, and opening new avenues for deep learning.
Related papers
- Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
GPU memory constraints have become a notable bottleneck in training such sizable models.
This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments.
This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
arXiv Detail & Related papers (2024-03-17T13:06:29Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Biologically Plausible Learning on Neuromorphic Hardware Architectures [27.138481022472]
Neuromorphic computing is an emerging paradigm that confronts this imbalance by computations directly in analog memories.
This work is the first to compare the impact of different learning algorithms on Compute-In-Memory-based hardware and vice versa.
arXiv Detail & Related papers (2022-12-29T15:10:59Z) - Scalable Graph Convolutional Network Training on Distributed-Memory
Systems [5.169989177779801]
Graph Convolutional Networks (GCNs) are extensively utilized for deep learning on graphs.
Since the convolution operation on graphs induces irregular memory access patterns, designing a memory- and communication-efficient parallel algorithm for GCN training poses unique challenges.
We propose a highly parallel training algorithm that scales to large processor counts.
arXiv Detail & Related papers (2022-12-09T17:51:13Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - Gradient Forward-Propagation for Large-Scale Temporal Video Modelling [13.665160620951777]
Backpropagation blocks computations until the forward and backward passes are completed.
For temporal signals, this introduces high latency and hinders real-time learning.
In this paper, we build upon Sideways, which avoids blocking by propagating approximate gradients forward in time.
We show how to decouple computation and delegate individual neural modules to different devices, allowing distributed and parallel training.
arXiv Detail & Related papers (2021-06-15T17:50:22Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - One-step regression and classification with crosspoint resistive memory
arrays [62.997667081978825]
High speed, low energy computing machines are in demand to enable real-time artificial intelligence at the edge.
One-step learning is supported by simulations of the prediction of the cost of a house in Boston and the training of a 2-layer neural network for MNIST digit recognition.
Results are all obtained in one computational step, thanks to the physical, parallel, and analog computing within the crosspoint array.
arXiv Detail & Related papers (2020-05-05T08:00:07Z) - Pipelined Backpropagation at Scale: Training Large Models without
Batches [0.9580895202050946]
We evaluate the use of small batch, fine-grained Pipelined Backpropagation, an asynchronous pipeline parallel training algorithm.
We show that appropriate normalization and small batch sizes can also aid training.
arXiv Detail & Related papers (2020-03-25T22:26:28Z) - Large-Scale Gradient-Free Deep Learning with Recursive Local
Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources.
Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize.
We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.