Exploring Scaling Laws for Local SGD in Large Language Model Training
- URL: http://arxiv.org/abs/2409.13198v1
- Date: Fri, 20 Sep 2024 04:02:48 GMT
- Title: Exploring Scaling Laws for Local SGD in Large Language Model Training
- Authors: Qiaozhi He, Xiaomin Zhuang, Zhihua Wu,
- Abstract summary: We show that local SGD achieves competitive results compared to conventional methods, given equivalent model parameters, datasets, and computational resources.
This demonstrates its viability as an alternative to single large-cluster training.
- Score: 4.125418728284004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates scaling laws for local SGD in LLM training, a distributed optimization algorithm that facilitates training on loosely connected devices. Through extensive experiments, we show that local SGD achieves competitive results compared to conventional methods, given equivalent model parameters, datasets, and computational resources. Furthermore, we explore the application of local SGD in various practical scenarios, including multi-cluster setups and edge computing environments. Our findings elucidate the necessary conditions for effective multi-cluster LLM training and examine the potential and limitations of leveraging edge computing resources in the LLM training process. This demonstrates its viability as an alternative to single large-cluster training.
Related papers
- SoupLM: Model Integration in Large Language and Multi-Modal Models [51.12227693121004]
Training large language models (LLMs) requires significant computing resources.
Existing publicly available LLMs are typically pre-trained on diverse, privately curated datasets spanning various tasks.
arXiv Detail & Related papers (2024-07-11T05:38:15Z) - Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently.
Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting.
We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z) - Adaptive Layer Splitting for Wireless LLM Inference in Edge Computing: A Model-Based Reinforcement Learning Approach [18.153641696306707]
This study introduces a framework taking inspiration from model-based reinforcement learning (MBRL) to determine the optimal splitting point across the edge and user equipment (UE)
By incorporating a reward surrogate model, our approach significantly reduces the computational cost of frequent performance evaluations.
arXiv Detail & Related papers (2024-06-03T09:41:42Z) - LLS: Local Learning Rule for Deep Neural Networks Inspired by Neural Activity Synchronization [6.738409533239947]
Training deep neural networks (DNNs) using traditional backpropagation (BP) presents challenges in terms of computational complexity and energy consumption.
We propose a novel Local Learning rule inspired by neural activity Synchronization phenomena (LLS) observed in the brain.
LLS achieves comparable performance with up to $300 times$ fewer multiply-accumulate (MAC) operations and half the memory requirements of BP.
arXiv Detail & Related papers (2024-05-24T18:24:24Z) - Checkpoint Merging via Bayesian Optimization in LLM Pretraining [10.743581503931523]
We propose checkpoint merging in pretraining large language models (LLMs)
Our proposed methodology exhibits the capacity to augment pretraining, presenting an opportunity akin to obtaining substantial benefits at minimal cost.
arXiv Detail & Related papers (2024-03-28T13:01:18Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training [3.0051215935332505]
This paper presents our profiling-driven simulator called vTrain to determine an efficient and cost-effective training system configuration.
We demonstrate vTrain's practicality through several case studies, e.g., effectively evaluating optimal training parallelization strategies.
arXiv Detail & Related papers (2023-11-27T13:35:15Z) - FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large
Language Models in Federated Learning [70.38817963253034]
This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution.
We provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios.
We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings.
arXiv Detail & Related papers (2023-09-01T09:40:36Z) - Unifying Synergies between Self-supervised Learning and Dynamic
Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z) - Local SGD: Unified Theory and New Efficient Methods [8.701566919381223]
We present a unified framework for analyzing local SGD methods in the convex and strongly convex regimes.
We develop the first linearly converging local SGD method which does not require any data homogeneity or other strong assumptions.
arXiv Detail & Related papers (2020-11-03T13:02:50Z) - Dif-MAML: Decentralized Multi-Agent Meta-Learning [54.39661018886268]
We propose a cooperative multi-agent meta-learning algorithm, referred to as MAML or Dif-MAML.
We show that the proposed strategy allows a collection of agents to attain agreement at a linear rate and to converge to a stationary point of the aggregate MAML.
Simulation results illustrate the theoretical findings and the superior performance relative to the traditional non-cooperative setting.
arXiv Detail & Related papers (2020-10-06T16:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.