EF-Train: Enable Efficient On-device CNN Training on FPGA Through Data
Reshaping for Online Adaptation or Personalization
- URL: http://arxiv.org/abs/2202.10935v1
- Date: Fri, 18 Feb 2022 18:27:42 GMT
- Title: EF-Train: Enable Efficient On-device CNN Training on FPGA Through Data
Reshaping for Online Adaptation or Personalization
- Authors: Yue Tang, Xinyi Zhang, Peipei Zhou, Jingtong Hu
- Abstract summary: EF-Train is an efficient DNN training accelerator with a unified channel-level parallelism-based convolution kernel.
It can achieve end-to-end training on resource-limited low-power edge-level FPGAs.
Our design achieves 46.99 GFLOPS and 6.09GFLOPS/W in terms of throughput and energy efficiency.
- Score: 11.44696439060875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventionally, DNN models are trained once in the cloud and deployed in edge
devices such as cars, robots, or unmanned aerial vehicles (UAVs) for real-time
inference. However, there are many cases that require the models to adapt to
new environments, domains, or new users. In order to realize such domain
adaption or personalization, the models on devices need to be continuously
trained on the device. In this work, we design EF-Train, an efficient DNN
training accelerator with a unified channel-level parallelism-based convolution
kernel that can achieve end-to-end training on resource-limited low-power
edge-level FPGAs. It is challenging to implement on-device training on
resource-limited FPGAs due to the low efficiency caused by different memory
access patterns among forward, backward propagation, and weight update.
Therefore, we developed a data reshaping approach with intra-tile continuous
memory allocation and weight reuse. An analytical model is established to
automatically schedule computation and memory resources to achieve high energy
efficiency on edge FPGAs. The experimental results show that our design
achieves 46.99 GFLOPS and 6.09GFLOPS/W in terms of throughput and energy
efficiency, respectively.
Related papers
- Toward Efficient Convolutional Neural Networks With Structured Ternary Patterns [1.1965844936801797]
Convolutional neural networks (ConvNets) exert severe demands on local device resources.
This brief presents work toward utilizing static convolutional filters to design efficient ConvNet architectures.
arXiv Detail & Related papers (2024-07-20T10:18:42Z) - Energy-Efficient Federated Edge Learning with Streaming Data: A Lyapunov Optimization Approach [34.00679567444125]
We develop a dynamic scheduling and resource allocation algorithm to address the inherent randomness in data arrivals and resource availability under long-term energy constraints.
Our proposed algorithm makes adaptive decisions on device scheduling, computational capacity adjustment, and allocation of bandwidth and transmit power in every round.
The effectiveness of our scheme is verified through simulation results, demonstrating improved learning performance and energy efficiency as compared to baseline schemes.
arXiv Detail & Related papers (2024-05-20T14:13:22Z) - Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch [72.26822499434446]
Auto-Train-Once (ATO) is an innovative network pruning algorithm designed to automatically reduce the computational and storage costs of DNNs.
We provide a comprehensive convergence analysis as well as extensive experiments, and the results show that our approach achieves state-of-the-art performance across various model architectures.
arXiv Detail & Related papers (2024-03-21T02:33:37Z) - Efficient Language Model Architectures for Differentially Private
Federated Learning [21.280600854272716]
Cross-device federated learning (FL) is a technique that trains a model on data distributed across typically millions of edge devices without data leaving the devices.
In centralized training of language models, adaptives are preferred as they offer improved stability and performance.
We propose a scale-in Coupled Input Forget Gate (SI CIFG) recurrent network by modifying the sigmoid and tanh activations in neural recurrent cell.
arXiv Detail & Related papers (2024-03-12T22:21:48Z) - Efficient Asynchronous Federated Learning with Sparsification and
Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data.
FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training.
We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z) - Energy-efficient Task Adaptation for NLP Edge Inference Leveraging
Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks.
We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training
Framework for Heterogeneous Edge Devices [21.513786638743234]
FTPipeHD is a novel framework that trains deep learning models across heterogeneous devices.
It is shown that FTPipeHD is 6.8x faster in training than the state of the art method when the computing capacity of the best device is 10x greater than the worst one.
arXiv Detail & Related papers (2021-10-06T14:00:22Z) - perf4sight: A toolflow to model CNN training performance on Edge GPUs [16.61258138725983]
This work proposes perf4sight, an automated methodology for developing accurate models that predict CNN training memory footprint and latency.
With PyTorch as the framework and NVIDIA Jetson TX2 as the target device, the developed models predict training memory footprint and latency with 95% and 91% accuracy respectively.
arXiv Detail & Related papers (2021-08-12T07:55:37Z) - FracTrain: Fractionally Squeezing Bit Savings Both Temporally and
Spatially for Efficient DNN Training [81.85361544720885]
We propose FracTrain that integrates progressive fractional quantization which gradually increases the precision of activations, weights, and gradients.
FracTrain reduces computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%+1.87%) accuracy.
arXiv Detail & Related papers (2020-12-24T05:24:10Z) - Fast-Convergent Federated Learning [82.32029953209542]
Federated learning is a promising solution for distributing machine learning tasks through modern networks of mobile devices.
We propose a fast-convergent federated learning algorithm, called FOLB, which performs intelligent sampling of devices in each round of model training.
arXiv Detail & Related papers (2020-07-26T14:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.