Petals: Collaborative Inference and Fine-tuning of Large Models
- URL: http://arxiv.org/abs/2209.01188v1
- Date: Fri, 2 Sep 2022 17:38:03 GMT
- Title: Petals: Collaborative Inference and Fine-tuning of Large Models
- Authors: Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin,
Younes Belkada, Artem Chumachenko, Pavel Samygin, Colin Raffel
- Abstract summary: Many NLP tasks benefit from using large language models (LLMs) that often have more than 100 billion parameters.
With the release of BLOOM-176B and OPT-175B, everyone can download pretrained models of this scale.
We propose Petals $-$ a system for inference and fine-tuning of large models collaboratively by joining the resources of multiple parties.
- Score: 78.37798144357977
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many NLP tasks benefit from using large language models (LLMs) that often
have more than 100 billion parameters. With the release of BLOOM-176B and
OPT-175B, everyone can download pretrained models of this scale. Still, using
these models requires high-end hardware unavailable to many researchers. In
some cases, LLMs can be used more affordably via RAM offloading or hosted APIs.
However, these techniques have innate limitations: offloading is too slow for
interactive inference, while APIs are not flexible enough for research. In this
work, we propose Petals $-$ a system for inference and fine-tuning of large
models collaboratively by joining the resources of multiple parties trusted to
process client's data. We demonstrate that this strategy significantly
outperforms offloading for very large models, running inference of BLOOM-176B
on consumer GPUs with $\approx$ 1 step per second. Unlike most inference APIs,
Petals also natively exposes the hidden states of served models, allowing its
users to train and share custom model extensions based on efficient fine-tuning
methods.
Related papers
- Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models [40.41898661688188]
This paper introduces Superpipeline, a framework designed to optimize the execution of large AI models on constrained hardware.
Superpipeline reduces GPU memory usage by up to 60% in our experiments while maintaining model accuracy and acceptable processing speeds.
arXiv Detail & Related papers (2024-10-11T13:17:05Z) - Improving Large Models with Small models: Lower Costs and Better Performance [81.55672406002715]
We propose Data Shunt$+$ (DS$+$), a general paradigm for collaboration of small and large models.
For instance, ChatGPT achieves an accuracy of $94.43%$ on Amazon Product sentiment analysis, and DS$+$ achieves an accuracy of $95.64%$, while the cost has been reduced to only $31.18%$.
arXiv Detail & Related papers (2024-06-15T14:44:43Z) - Fast Inference of Mixture-of-Experts Language Models with Offloading [0.7998559449733824]
We study the problem of running large MoE language models on consumer hardware with limited accelerator memory.
Using this strategy, we build can run Mixtral-8x7B with mixed quantization on desktop hardware and free-tier Google Colab instances.
arXiv Detail & Related papers (2023-12-28T18:58:13Z) - Distributed Inference and Fine-tuning of Large Language Models Over The
Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size.
These models require high-end hardware, making them inaccessible to most researchers.
We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z) - Herd: Using multiple, smaller LLMs to match the performances of proprietary, large LLMs via an intelligent composer [1.3108652488669732]
We show that a herd of open source models can match or exceed the performance of proprietary models via an intelligent router.
In cases where GPT is not able to answer the query, Herd is able to identify a model that can, at least 40% of the time.
arXiv Detail & Related papers (2023-10-30T18:11:02Z) - "Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow [5.036273913335737]
We train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $$187$ and $$800$ each.
Results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
arXiv Detail & Related papers (2023-06-05T21:38:30Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z) - AlpaServe: Statistical Multiplexing with Model Parallelism for Deep
Learning Serving [53.01646445659089]
We show that model parallelism can be used for the statistical multiplexing of multiple devices when serving multiple models.
We present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models.
arXiv Detail & Related papers (2023-02-22T21:41:34Z) - Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost.
We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing.
This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.