Related papers: RedCoast: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs

RedCoast: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs

URL: http://arxiv.org/abs/2310.16355v3
Date: Wed, 12 Jun 2024 22:07:54 GMT
Title: RedCoast: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs
Authors: Bowen Tan, Yun Zhu, Lijuan Liu, Hongyi Wang, Yonghao Zhuang, Jindong Chen, Eric Xing, Zhiting Hu,
Abstract summary: We present RedCoast, a tool crafted to automate distributed training and inference for large language models (LLMs) We also propose a mechanism that allows for the customization of diverse ML pipelines through the definition of merely three functions. As a result, Redco implementations exhibit significantly fewer lines of code compared to their official counterparts.
Score: 32.01139974519813
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent progress of AI can be largely attributed to large language models (LLMs). However, their escalating memory requirements introduce challenges for machine learning (ML) researchers and engineers. Addressing this requires developers to partition a large model to distribute it across multiple GPUs or TPUs. This necessitates considerable coding and intricate configuration efforts with existing model parallel tools, such as Megatron-LM, DeepSpeed, and Alpa. These tools require users' expertise in machine learning systems (MLSys), creating a bottleneck in LLM development, particularly for developers without MLSys background. In this work, we present RedCoast (Redco), a lightweight and user-friendly tool crafted to automate distributed training and inference for LLMs, as well as to simplify ML pipeline development. The design of Redco emphasizes two key aspects. Firstly, to automate model parallelism, our study identifies two straightforward rules to generate tensor parallel strategies for any given LLM. Integrating these rules into Redco facilitates effortless distributed LLM training and inference, eliminating the need of additional coding or complex configurations. We demonstrate the effectiveness by applying Redco on a set of LLM architectures, such as GPT-J, LLaMA, T5, and OPT, up to the size of 66B. Secondly, we propose a mechanism that allows for the customization of diverse ML pipelines through the definition of merely three functions, avoiding redundant and formulaic code like multi-host related processing. This mechanism proves adaptable across a spectrum of ML algorithms, from foundational language modeling to complex algorithms like meta-learning and reinforcement learning. As a result, Redco implementations exhibit significantly fewer lines of code compared to their official counterparts.

Related papers

PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing [48.30406812516552]
We introduce the PLM, a Peripheral Language Model, developed through a co-design process that jointly optimize model architecture and edge system constraints. PLM employs a Multi-head Latent Attention mechanism and employs the squared ReLU activation function to encourage sparsity, thereby reducing peak memory footprint. evaluation results demonstrate that PLM outperforms existing small language models trained on publicly available data.
arXiv Detail & Related papers (2025-03-15T15:11:17Z)
Adaptive Self-improvement LLM Agentic System for ML Library Development [8.766639641127412]
Large language models (LLMs) have shown general coding capabilities. LLMs need complex reasoning with limited data in order to complete this task. We introduce an adaptive self-improvement agentic system to generate ASPL code with both open and closed-source LLMs.
arXiv Detail & Related papers (2025-02-04T17:57:17Z)
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM. DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z)
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM. Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM. We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z)
Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models [16.16372459671255]
Large Language Models (LLMs) typically generate outputs token by token using a fixed compute budget. We propose a novel framework that integrates smaller auxiliary modules within each Feed-Forward Network layer of the LLM. We show that trained routers operate differently from oracles and often yield suboptimal solutions.
arXiv Detail & Related papers (2024-10-01T16:10:21Z)
Rome was Not Built in a Single Step: Hierarchical Prompting for LLM-based Chip Design [22.70660876673987]
Large Language Models (LLMs) are effective in computer hardware synthesis via hardware description language (HDL) generation. However, LLM-assisted approaches for HDL generation struggle when handling complex tasks. We introduce a suite of hierarchical prompting techniques which facilitate efficient stepwise design methods.
arXiv Detail & Related papers (2024-07-23T21:18:31Z)
LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models [3.4070166891274263]
Large language models (LLMs) solve natural language processing, complex reasoning, sentiment analysis and other tasks. These abilities come with very high memory and computational costs which precludes the use of LLMs on most hardware platforms. We propose an effective method of finding Pareto-optimal network architectures based on LLaMA2-7B using one-shot NAS. We show that, for certain standard benchmark tasks, the pre-trained LLaMA2-7B network is unnecessarily large and complex.
arXiv Detail & Related papers (2024-05-28T17:20:44Z)
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size. These models require high-end hardware, making them inaccessible to most researchers. We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z)
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets [75.64181719386497]
We present CRAFT, a tool creation and retrieval framework for large language models (LLMs) It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks. Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning.
arXiv Detail & Related papers (2023-09-29T17:40:26Z)
Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models [11.845239346943067]
parameter-efficient fine-tuning (PEFT) is a promising approach to efficiently specialize large language models (LLMs) to task-specific data. Our study highlights the potential for tuning larger LLMs and significant reductions in memory usage by combining PEFT with quantization.
arXiv Detail & Related papers (2023-08-21T04:31:06Z)
LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices. We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z)
MLGO: a Machine Learning Guided Compiler Optimizations Framework [0.0]
This work is the first full integration of machine learning in a complex compiler pass in a real-world setting. We use two different ML algorithms to train the inlining-for-size model, and achieve up to 7% size reduction. The same model generalizes well to a diversity of real-world targets, as well as to the same set of targets after months of active development.
arXiv Detail & Related papers (2021-01-13T00:02:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.