Related papers: Accelerated Training on Low-Power Edge Devices

Accelerated Training on Low-Power Edge Devices

URL: http://arxiv.org/abs/2502.18323v1
Date: Tue, 25 Feb 2025 16:18:15 GMT
Title: Accelerated Training on Low-Power Edge Devices
Authors: Mohamed Aboelenien Ahmed, Kilian Pfeiffer, Heba Khdr, Osama Abboud, Ramin Khalili, Jörg Henkel,
Abstract summary: Training on edge devices poses several challenges as these devices are generally resource-constrained, especially in terms of power.<n>We propose to jointly adjust the system and application parameters while adhering to the power constraints on devices.<n>We introduce a novel cross-layer methodology that combines predictions of batch size efficiency and device profiling to achieve the desired optimization.
Score: 11.02161053136761
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training on edge devices poses several challenges as these devices are generally resource-constrained, especially in terms of power. State-of-the-art techniques at the device level reduce the GPU frequency to enforce power constraints, leading to a significant increase in training time. To accelerate training, we propose to jointly adjust the system and application parameters (in our case, the GPU frequency and the batch size of the training task) while adhering to the power constraints on devices. We introduce a novel cross-layer methodology that combines predictions of batch size efficiency and device profiling to achieve the desired optimization. Our evaluation on real hardware shows that our method outperforms the current baselines that depend on state of the art techniques, reducing the training time by $2.4\times$ with results very close to optimal. Our measurements also indicate a substantial reduction in the overall energy used for the training process. These gains are achieved without reduction in the performance of the trained model.

Related papers

QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge [55.75103034526652]
We propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs. Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost. We design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability.
arXiv Detail & Related papers (2025-03-20T21:03:10Z)
HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration [5.88033624474104]
HALO is a versatile framework for Hardware-Aware Post-Training Quantization (PTQ) Unlike traditional methods, HALO explicitly incorporates detailed hardware characteristics, including critical-path timing and power consumption. Average performance improvements of 270% and energy savings of 51% over baseline quantization methods.
arXiv Detail & Related papers (2025-02-27T01:08:33Z)
Taming 3DGS: High-Quality Radiance Fields with Limited Resources [50.92437599516609]
3D Gaussian Splatting (3DGS) has transformed novel-view synthesis with its fast, interpretable, and high-fidelity rendering. We tackle the challenges of training and rendering 3DGS models on a budget. We derive faster, numerically equivalent solutions for gradient computation and attribute updates.
arXiv Detail & Related papers (2024-06-21T20:44:23Z)
Efficient Asynchronous Federated Learning with Sparsification and Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data. FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training. We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z)
SCoTTi: Save Computation at Training Time with an adaptive framework [7.780766187171572]
On-device training is an emerging approach in machine learning where models are trained on edge devices. We propose SCoTTi (Save Computation at Training Time), an adaptive framework that addresses the challenge of reducing resource consumption during training. Our proposed approach demonstrates superior performance compared to the state-of-the-art methods regarding computational resource savings on various commonly employed benchmarks.
arXiv Detail & Related papers (2023-12-19T16:19:33Z)
Aggregating Capacity in FL through Successive Layer Training for Computationally-Constrained Devices [3.4530027457862]
Federated learning (FL) is usually performed on resource-constrained edge devices. FL training process should be adjusted to such constraints. We propose a new method that enables successive freezing and training of the parameters of the FL model at devices.
arXiv Detail & Related papers (2023-05-26T15:04:06Z)
TransCODE: Co-design of Transformers and Accelerators for Efficient Training and Inference [6.0093441900032465]
We propose a framework that simulates transformer inference and training on a design space of accelerators. We use this simulator in conjunction with the proposed co-design technique, called TransCODE, to obtain the best-performing models. The obtained transformer-accelerator pair achieves 0.3% higher accuracy than the state-of-the-art pair.
arXiv Detail & Related papers (2023-03-27T02:45:18Z)
Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size [58.762959061522736]
We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude. We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
arXiv Detail & Related papers (2022-11-20T21:48:25Z)
ZeroFL: Efficient On-Device Training for Federated Learning with Local Sparsity [15.908499928588297]
In Federated Learning (FL), nodes are orders of magnitude more constrained than traditional server-grade hardware. We propose ZeroFL, a framework that relies on highly sparse operations to accelerate on-device training.
arXiv Detail & Related papers (2022-08-04T07:37:07Z)
On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z)
Ps and Qs: Quantization-aware pruning for efficient low latency neural network inference [56.24109486973292]
We study the interplay between pruning and quantization during the training of neural networks for ultra low latency applications. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task.
arXiv Detail & Related papers (2021-02-22T19:00:05Z)
Improving the Speed and Quality of GAN by Adversarial Training [87.70013107142142]
We develop FastGAN to improve the speed and quality of GAN training based on the adversarial training technique. Our training algorithm brings ImageNet training to the broader public by requiring 2-4 GPUs.
arXiv Detail & Related papers (2020-08-07T20:21:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.