vTrain: A Simulation Framework for Evaluating Cost-effective and
Compute-optimal Large Language Model Training
- URL: http://arxiv.org/abs/2312.12391v1
- Date: Mon, 27 Nov 2023 13:35:15 GMT
- Title: vTrain: A Simulation Framework for Evaluating Cost-effective and
Compute-optimal Large Language Model Training
- Authors: Jehyeon Bang, Yujeong Choi, Myeongwoo Kim, Yongdeok Kim, Minsoo Rhu
- Abstract summary: This paper presents our profiling-driven simulator called vTrain to determine an efficient and cost-effective training system configuration.
We demonstrate vTrain's practicality through several case studies, e.g., effectively evaluating optimal training parallelization strategies.
- Score: 3.224032543241306
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As large language models (LLMs) become widespread in various application
domains, a critical challenge the AI community is facing is how to train these
large AI models in a cost-effective manner. Existing LLM training plans
typically employ a heuristic based parallel training strategy which is based on
empirical observations rather than grounded upon a thorough examination of the
search space of LLM parallelization. Such limitation renders existing systems
to leave significant performance left on the table, wasting millions of dollars
worth of training cost. This paper presents our profiling-driven simulator
called vTrain, providing AI practitioners a fast yet accurate software
framework to determine an efficient and cost-effective LLM training system
configuration. We demonstrate vTrain's practicality through several case
studies, e.g., effectively evaluating optimal training parallelization
strategies that balances training time and its associated training cost,
efficient multi-tenant GPU cluster schedulers targeting multiple LLM training
jobs, and determining a compute-optimal LLM model architecture given a fixed
compute budget.
Related papers
- SoupLM: Model Integration in Large Language and Multi-Modal Models [51.12227693121004]
Training large language models (LLMs) requires significant computing resources.
Existing publicly available LLMs are typically pre-trained on diverse, privately curated datasets spanning various tasks.
arXiv Detail & Related papers (2024-07-11T05:38:15Z) - One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments [43.107261545706415]
Large Language Models (LLMs) have advanced rapidly but face significant memory demands.
Current methods typically require lengthy training to alleviate the performance degradation from quantization loss.
We make an initial attempt to extend the once-for-all framework to large language models.
arXiv Detail & Related papers (2024-05-30T16:05:15Z) - From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems [59.40480894948944]
Large language model (LLM) empowered agents are able to solve decision-making problems in the physical world.
Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting.
We prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning.
arXiv Detail & Related papers (2024-05-30T09:42:54Z) - Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [70.09561665520043]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans.
We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems.
Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z) - The Future of Large Language Model Pre-training is Federated [15.237418036900582]
Federated learning has the potential to unleash the majority of the planet's data and computational resources.
We propose a scalable deployment system called Photon to enable the investigation and development of this new training paradigm.
We show that Photon can be used by organizations interested in collaborating with their private data sources and computational resources.
arXiv Detail & Related papers (2024-05-17T15:27:52Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Efficient Device Scheduling with Multi-Job Federated Learning [64.21733164243781]
We propose a novel multi-job Federated Learning framework to enable the parallel training process of multiple jobs.
We propose a reinforcement learning-based method and a Bayesian optimization-based method to schedule devices for multiple jobs while minimizing the cost.
Our proposed approaches significantly outperform baseline approaches in terms of training time (up to 8.67 times faster) and accuracy (up to 44.6% higher)
arXiv Detail & Related papers (2021-12-11T08:05:11Z) - MLComp: A Methodology for Machine Learning-based Performance Estimation
and Adaptive Selection of Pareto-Optimal Compiler Optimization Sequences [10.200899224740871]
We propose a novel Reinforcement Learning-based policy methodology for embedded software optimization.
We show that different Machine Learning models are automatically tested to choose the best-fitting one.
We also show that our framework can be trained efficiently for any target platform and application domain.
arXiv Detail & Related papers (2020-12-09T19:13:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.