Accelerating Transfer Learning with Near-Data Computation on Cloud
Object Stores
- URL: http://arxiv.org/abs/2210.08650v1
- Date: Sun, 16 Oct 2022 22:28:36 GMT
- Title: Accelerating Transfer Learning with Near-Data Computation on Cloud
Object Stores
- Authors: Arsany Guirguis, Diana Petrescu, Florin Dinu, Do Le Quoc, Javier
Picorel, Rachid Guerraoui
- Abstract summary: This paper identifies transfer learning (TL) as a natural fit for the disaggregated cloud.
We show how to leverage the unique structure of TL's fine-tuning phase to flexibly address the aforementioned constraints.
We present HAPI, a processing system for TL that spans the compute and storage tiers while remaining transparent to the user.
- Score: 5.057544107331778
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Storage disaggregation is fundamental to today's cloud due to cost and
scalability benefits. Unfortunately, this design must cope with an inherent
network bottleneck between the storage and the compute tiers. The widely
deployed mitigation strategy is to provide computational resources next to
storage to push down a part of an application and thus reduce the amount of
data transferred to the compute tier. Overall, users of disaggregated storage
need to consider two main constraints: the network may remain a bottleneck, and
the storage-side computational resources are limited. This paper identifies
transfer learning (TL) as a natural fit for the disaggregated cloud. TL,
famously described as the next driver of ML commercial success, is widely
popular and has broad-range applications. We show how to leverage the unique
structure of TL's fine-tuning phase (i.e., a combination of feature extraction
and training) to flexibly address the aforementioned constraints and improve
both user and operator-centric metrics. The key to improving user-perceived
performance is to mitigate the network bottleneck by carefully splitting the TL
deep neural network (DNN) such that feature extraction is, partially or
entirely, executed next to storage. Crucially, such splitting enables
decoupling the batch size of feature extraction from the training batch size,
facilitating efficient storage-side batch size adaptation to increase
concurrency in the storage tier while avoiding out-of-memory errors. Guided by
these insights, we present HAPI, a processing system for TL that spans the
compute and storage tiers while remaining transparent to the user. Our
evaluation with several DNNs, such as ResNet, VGG, and Transformer, shows up to
11x improvement in application runtime and up to 8.3x reduction in the data
transferred from the storage to the compute tier compared to running the
computation in the compute tier.
Related papers
- How to Train Your LLM Web Agent: A Statistical Diagnosis [102.04125085041473]
We present the first statistically grounded study on compute allocation for LLM web-agent post-training.<n>Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT) and on-policy reinforcement learning.<n>Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++.
arXiv Detail & Related papers (2025-07-05T17:12:33Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - Breaking MLPerf Training: A Case Study on Optimizing BERT [9.486916730173661]
We present novel approaches for fast large-scale training of BERT model.
Load balancing is imperative in distributed BERT training since its training are characterized by samples with various lengths.
We propose two new ideas, (1) local presorting based on dataset stratification for load balancing and (2) bucket-wise gradient clipping before allreduce.
arXiv Detail & Related papers (2024-02-04T11:12:17Z) - Efficient Asynchronous Federated Learning with Sparsification and
Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data.
FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training.
We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z) - tf.data service: A Case for Disaggregating ML Input Data Processing [4.851146762916078]
Machine learning (ML) computations commonly execute on expensive specialized hardware, such as GPUs and TPUs, which provide high FLOPs and performance-per-watt.
To avoid data stalls, the host CPU and RAM required for input data processing per accelerator core used for ML computations varies across jobs.
We present tf.data service, an open-source disaggregated input data processing service built on top of tf.data in.
arXiv Detail & Related papers (2022-10-26T16:15:45Z) - Acceleration of Federated Learning with Alleviated Forgetting in Local
Training [61.231021417674235]
Federated learning (FL) enables distributed optimization of machine learning models while protecting privacy.
We propose FedReg, an algorithm to accelerate FL with alleviated knowledge forgetting in the local training stage.
Our experiments demonstrate that FedReg not only significantly improves the convergence rate of FL, especially when the neural network architecture is deep.
arXiv Detail & Related papers (2022-03-05T02:31:32Z) - HeterPS: Distributed Deep Learning With Reinforcement Learning Based
Scheduling in Heterogeneous Environments [37.55572042288321]
Training process of neural networks (DNNs) generally handles large-scale input data with many sparse features.
Paddle-HeterPS is composed of a distributed architecture and a Reinforcement Reinforcement (RL)-based scheduling method.
We show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller)
arXiv Detail & Related papers (2021-11-20T17:09:15Z) - Improving Computational Efficiency in Visual Reinforcement Learning via
Stored Embeddings [89.63764845984076]
We present Stored Embeddings for Efficient Reinforcement Learning (SEER)
SEER is a simple modification of existing off-policy deep reinforcement learning methods.
We show that SEER does not degrade the performance of RLizable agents while significantly saving computation and memory.
arXiv Detail & Related papers (2021-03-04T08:14:10Z) - FracTrain: Fractionally Squeezing Bit Savings Both Temporally and
Spatially for Efficient DNN Training [81.85361544720885]
We propose FracTrain that integrates progressive fractional quantization which gradually increases the precision of activations, weights, and gradients.
FracTrain reduces computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%+1.87%) accuracy.
arXiv Detail & Related papers (2020-12-24T05:24:10Z) - Weight Update Skipping: Reducing Training Time for Artificial Neural
Networks [0.30458514384586394]
We propose a new training methodology for ANNs that exploits the observation of improvement of accuracy shows temporal variations.
During such time windows, we keep updating bias which ensures the network still trains and avoids overfitting.
Such a training approach virtually achieves the same accuracy with considerably less computational cost, thus lower training time.
arXiv Detail & Related papers (2020-12-05T15:12:10Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.