Related papers: Petals: Collaborative Inference and Fine-tuning of Large Models

Petals: Collaborative Inference and Fine-tuning of Large Models

URL: http://arxiv.org/abs/2209.01188v1
Date: Fri, 2 Sep 2022 17:38:03 GMT
Title: Petals: Collaborative Inference and Fine-tuning of Large Models
Authors: Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel Samygin, Colin Raffel
Abstract summary: Many NLP tasks benefit from using large language models (LLMs) that often have more than 100 billion parameters. With the release of BLOOM-176B and OPT-175B, everyone can download pretrained models of this scale. We propose Petals $-$ a system for inference and fine-tuning of large models collaboratively by joining the resources of multiple parties.
Score: 78.37798144357977
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many NLP tasks benefit from using large language models (LLMs) that often have more than 100 billion parameters. With the release of BLOOM-176B and OPT-175B, everyone can download pretrained models of this scale. Still, using these models requires high-end hardware unavailable to many researchers. In some cases, LLMs can be used more affordably via RAM offloading or hosted APIs. However, these techniques have innate limitations: offloading is too slow for interactive inference, while APIs are not flexible enough for research. In this work, we propose Petals $-$ a system for inference and fine-tuning of large models collaboratively by joining the resources of multiple parties trusted to process client's data. We demonstrate that this strategy significantly outperforms offloading for very large models, running inference of BLOOM-176B on consumer GPUs with $\approx$ 1 step per second. Unlike most inference APIs, Petals also natively exposes the hidden states of served models, allowing its users to train and share custom model extensions based on efficient fine-tuning methods.

Related papers

Collaborative LLM Inference via Planning for Efficient Reasoning [50.04696654679751]
We propose a test-time collaboration framework in which a planner model first generates a plan, defined as a distilled and high-level abstraction of the problem.<n>Small and large models take turns acting as planner and reasoner, exchanging plans in a multi-round cascade to collaboratively solve complex tasks.<n>Our method achieves accuracy comparable to strong proprietary models alone, while significantly reducing reliance on paid inference.
arXiv Detail & Related papers (2025-06-13T08:35:50Z)
Label Privacy in Split Learning for Large Models with Parameter-Efficient Training [51.28799334394279]
We search for a way to fine-tune models over an API while keeping the labels private. We propose P$3$EFT, a multi-party split learning algorithm that takes advantage of existing PEFT properties to maintain privacy at a lower performance overhead.
arXiv Detail & Related papers (2024-12-21T15:32:03Z)
Puzzle: Distillation-Based NAS for Inference-Optimized LLMs [17.72841008597783]
Large language models (LLMs) offer remarkable capabilities, yet their high inference costs restrict wider adoption.<n>We present Puzzle, a hardware-aware framework that accelerates the inference of LLMs while preserving their capabilities.<n>We showcase our framework's impact via Llama-3.1-Nemotron-51B-Instruct (Nemotron-51B) and Llama-3.3-Nemotron-49B, two publicly available models.
arXiv Detail & Related papers (2024-11-28T13:45:42Z)
Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models [40.41898661688188]
This paper introduces Superpipeline, a framework designed to optimize the execution of large AI models on constrained hardware. Superpipeline reduces GPU memory usage by up to 60% in our experiments while maintaining model accuracy and acceptable processing speeds.
arXiv Detail & Related papers (2024-10-11T13:17:05Z)
Improving Large Models with Small models: Lower Costs and Better Performance [81.55672406002715]
We propose Data Shunt$+$ (DS$+$), a general paradigm for collaboration of small and large models. For instance, ChatGPT achieves an accuracy of $94.43%$ on Amazon Product sentiment analysis, and DS$+$ achieves an accuracy of $95.64%$, while the cost has been reduced to only $31.18%$.
arXiv Detail & Related papers (2024-06-15T14:44:43Z)
Fast Inference of Mixture-of-Experts Language Models with Offloading [0.7998559449733824]
We study the problem of running large MoE language models on consumer hardware with limited accelerator memory. Using this strategy, we build can run Mixtral-8x7B with mixed quantization on desktop hardware and free-tier Google Colab instances.
arXiv Detail & Related papers (2023-12-28T18:58:13Z)
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size. These models require high-end hardware, making them inaccessible to most researchers. We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z)
Herd: Using multiple, smaller LLMs to match the performances of proprietary, large LLMs via an intelligent composer [1.3108652488669732]
We show that a herd of open source models can match or exceed the performance of proprietary models via an intelligent router. In cases where GPT is not able to answer the query, Herd is able to identify a model that can, at least 40% of the time.
arXiv Detail & Related papers (2023-10-30T18:11:02Z)
"Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow [5.036273913335737]
We train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $$187$ and $$800$ each. Results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
arXiv Detail & Related papers (2023-06-05T21:38:30Z)
Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing. LLMs are extremely computationally expensive, even at inference time. We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z)
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving [53.01646445659089]
We show that model parallelism can be used for the statistical multiplexing of multiple devices when serving multiple models. We present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models.
arXiv Detail & Related papers (2023-02-22T21:41:34Z)
Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost. We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing. This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.