Caffe Barista: Brewing Caffe with FPGAs in the Training Loop
- URL: http://arxiv.org/abs/2006.13829v1
- Date: Thu, 18 Jun 2020 17:56:12 GMT
- Title: Caffe Barista: Brewing Caffe with FPGAs in the Training Loop
- Authors: Diederik Adriaan Vink, Aditya Rajagopal, Stylianos I. Venieris,
Christos-Savvas Bouganis
- Abstract summary: Barista is an automated toolflow that provides seamless integration of FPGAs into the training of Convolutional Neural Network (CNN)
This work presents Barista, an automated toolflow that provides seamless integration of FPGAs into the training of CNNs within the popular deep learning framework Caffe.
- Score: 13.83645579871775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the complexity of deep learning (DL) models increases, their compute
requirements increase accordingly. Deploying a Convolutional Neural Network
(CNN) involves two phases: training and inference. With the inference task
typically taking place on resource-constrained devices, a lot of research has
explored the field of low-power inference on custom hardware accelerators. On
the other hand, training is both more compute- and memory-intensive and is
primarily performed on power-hungry GPUs in large-scale data centres. CNN
training on FPGAs is a nascent field of research. This is primarily due to the
lack of tools to easily prototype and deploy various hardware and/or
algorithmic techniques for power-efficient CNN training. This work presents
Barista, an automated toolflow that provides seamless integration of FPGAs into
the training of CNNs within the popular deep learning framework Caffe. To the
best of our knowledge, this is the only tool that allows for such versatile and
rapid deployment of hardware and algorithms for the FPGA-based training of
CNNs, providing the necessary infrastructure for further research and
development.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Harnessing FPGA Technology for Enhanced Biomedical Computation [0.0]
This research delves into sophisticated neural network frameworks like CNN, Recurrent Neural Networks (RNN), Long Short-Term Memory Networks (LSTMs), and Deep Belief Networks (DBNs)
By evaluating performance indicators like latency and throughput, we showcase the efficacy of FPGAs in advanced biomedical computing.
arXiv Detail & Related papers (2023-11-21T08:51:58Z) - Transferability of Convolutional Neural Networks in Stationary Learning
Tasks [96.00428692404354]
We introduce a novel framework for efficient training of convolutional neural networks (CNNs) for large-scale spatial problems.
We show that a CNN trained on small windows of such signals achieves a nearly performance on much larger windows without retraining.
Our results show that the CNN is able to tackle problems with many hundreds of agents after being trained with fewer than ten.
arXiv Detail & Related papers (2023-07-21T13:51:45Z) - Exploiting FPGA Capabilities for Accelerated Biomedical Computing [0.0]
This study presents advanced neural network architectures for enhanced ECG signal analysis using Field Programmable Gate Arrays (FPGAs)
We utilize the MIT-BIH Arrhythmia Database for training and validation, introducing Gaussian noise to improve robustness.
The study ultimately offers a guide for optimizing neural network performance on FPGAs for various applications.
arXiv Detail & Related papers (2023-07-16T01:20:17Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - A Comprehensive Survey on Distributed Training of Graph Neural Networks [59.785830738482474]
Graph neural networks (GNNs) have been demonstrated to be a powerful algorithmic model in broad application fields.
To scale GNN training up for large-scale and ever-growing graphs, the most promising solution is distributed training.
The volume of related research on distributed GNN training is exceptionally vast, accompanied by an extraordinarily rapid pace of publication.
arXiv Detail & Related papers (2022-11-10T06:22:12Z) - Optimization of FPGA-based CNN Accelerators Using Metaheuristics [1.854931308524932]
convolutional neural networks (CNNs) have demonstrated their ability to solve problems in many fields.
FPGAs have seen a surge in interest for accelerating CNN inference.
Current trend in FPGA-based CNN accelerators is to implement multiple convolutional layer processors (CLPs)
arXiv Detail & Related papers (2022-09-22T18:57:49Z) - FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task.
The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources.
It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z) - Accelerating Training and Inference of Graph Neural Networks with Fast
Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement.
We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment.
We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler.
We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z) - Multi-node Bert-pretraining: Cost-efficient Approach [6.5998084177955425]
Large scale Transformer-based language models have brought about exciting leaps in state-of-the-art results for many Natural Language Processing (NLP) tasks.
With the advent of large-scale unsupervised datasets, training time is further extended due to the increased amount of data samples within a single training epoch.
We show that we are able to perform pre-training on BERT within a reasonable time budget (12 days) in an academic setting.
arXiv Detail & Related papers (2020-08-01T05:49:20Z) - CNN2Gate: Toward Designing a General Framework for Implementation of
Convolutional Neural Networks on FPGA [0.3655021726150368]
This paper introduces an integrated framework that supports compilation of a CNN model for an FPGA target.
CNN2Gate exploits the OpenCL synthesis workflow for FPGAs offered by commercial vendors.
This paper reports results of automatic synthesis and design-space exploration of AlexNet and VGG-16 on various Intel FPGA platforms.
arXiv Detail & Related papers (2020-04-06T01:57:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.