ALTO: An Efficient Network Orchestrator for Compound AI Systems
- URL: http://arxiv.org/abs/2403.04311v1
- Date: Thu, 7 Mar 2024 08:30:26 GMT
- Title: ALTO: An Efficient Network Orchestrator for Compound AI Systems
- Authors: Keshav Santhanam, Deepti Raghavan, Muhammad Shahir Rahman, Thejas
Venkatesh, Neha Kunjal, Pratiksha Thaker, Philip Levis, Matei Zaharia
- Abstract summary: ALTO is a network orchestrator for efficiently serving compound AI systems such as pipelines of language models.
As language models produce outputs token by token, ALTO exposes opportunities to stream intermediate outputs between stages when possible.
We highlight two new challenges of correctness and load balancing which emerge when streaming intermediate data across distributed pipeline stage instances.
- Score: 20.880866765513066
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present ALTO, a network orchestrator for efficiently serving compound AI
systems such as pipelines of language models. ALTO achieves high throughput and
low latency by taking advantage of an optimization opportunity specific to
generative language models: streaming intermediate outputs. As language models
produce outputs token by token, ALTO exposes opportunities to stream
intermediate outputs between stages when possible. We highlight two new
challenges of correctness and load balancing which emerge when streaming
intermediate data across distributed pipeline stage instances. We also motivate
the need for an aggregation-aware routing interface and distributed
prompt-aware scheduling to address these challenges. We demonstrate the impact
of ALTO's partial output streaming on a complex chatbot verification pipeline,
increasing throughput by up to 3x for a fixed latency target of 4 seconds /
request while also reducing tail latency by 1.8x compared to a baseline serving
approach.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - CALF: Aligning LLMs for Time Series Forecasting via Cross-modal Fine-Tuning [59.88924847995279]
We propose a novel Cross-Modal LLM Fine-Tuning (CALF) framework for MTSF.
To reduce the distribution discrepancy, we develop the cross-modal match module.
CALF establishes state-of-the-art performance for both long-term and short-term forecasting tasks.
arXiv Detail & Related papers (2024-03-12T04:04:38Z) - ALERT-Transformer: Bridging Asynchronous and Synchronous Machine Learning for Real-Time Event-based Spatio-Temporal Data [8.660721666999718]
We propose a hybrid pipeline composed of asynchronous sensing and synchronous processing.
We achieve performances state-of-the-art with a lower latency than competitors.
arXiv Detail & Related papers (2024-02-02T13:17:19Z) - Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving [10.926767319124547]
We present Apparate, a system that automatically applies and manages early exits in machine learning models.
To cope with the time-varying overhead and accuracy challenges that EEs bring, Apparate repurposes exits to provide continual feedback.
Apparate lowers median response latencies by 40.5--91.5% and 10.0--24.2% for diverse CV and NLP classification workloads.
arXiv Detail & Related papers (2023-12-08T21:49:09Z) - Turbocharge Speech Understanding with Pilot Inference [0.9699101045941684]
This paper sets to accelerate modern speech understanding on resource-constrained edge devices.
It takes a hybrid approach: to speed up on-device execution; to offload inputs that are beyond the device's capacity.
Our prototype, called PASU, is tested on Arm platforms with 6 - 8 cores: it delivers SOTA accuracy; it reduces the end-to-end latency by 2x and reduces the offloading needs by 2x.
arXiv Detail & Related papers (2023-11-22T17:14:18Z) - Fast Distributed Inference Serving for Large Language Models [12.703624317418237]
We present FastServe, a distributed inference serving system for large language models (LLMs)
FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token.
We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.
arXiv Detail & Related papers (2023-05-10T06:17:50Z) - Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural
Networks on Edge NPUs [74.83613252825754]
"smart ecosystems" are being formed where sensing happens concurrently rather than standalone.
This is shifting the on-device inference paradigm towards deploying neural processing units (NPUs) at the edge.
We propose a novel early-exit scheduling that allows preemption at run time to account for the dynamicity introduced by the arrival and exiting processes.
arXiv Detail & Related papers (2022-09-27T15:04:01Z) - Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks.
specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples.
We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z) - Synthetic Datasets for Neural Program Synthesis [66.20924952964117]
We propose a new methodology for controlling and evaluating the bias of synthetic data distributions over both programs and specifications.
We demonstrate, using the Karel DSL and a small Calculator DSL, that training deep networks on these distributions leads to improved cross-distribution generalization performance.
arXiv Detail & Related papers (2019-12-27T21:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.