Efficient Multi-stage Inference on Tabular Data
- URL: http://arxiv.org/abs/2303.11580v2
- Date: Fri, 21 Jul 2023 19:24:15 GMT
- Title: Efficient Multi-stage Inference on Tabular Data
- Authors: Daniel S Johnson and Igor L Markov
- Abstract summary: Conventional wisdom favors segregating ML code into services queried by product code via RPC APIs.
We simplify inference algorithms and embed them into the product code to reduce network communication.
By applying our optimization with AutoML to both training and inference, we reduce inference latency by 1.3x, CPU resources by 30%, and network communication between application front-end and ML back-end by about 50%.
- Score: 1.6371451481715193
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Many ML applications and products train on medium amounts of input data but
get bottlenecked in real-time inference. When implementing ML systems,
conventional wisdom favors segregating ML code into services queried by product
code via Remote Procedure Call (RPC) APIs. This approach clarifies the overall
software architecture and simplifies product code by abstracting away ML
internals. However, the separation adds network latency and entails additional
CPU overhead. Hence, we simplify inference algorithms and embed them into the
product code to reduce network communication. For public datasets and a
high-performance real-time platform that deals with tabular data, we show that
over half of the inputs are often amenable to such optimization, while the
remainder can be handled by the original model. By applying our optimization
with AutoML to both training and inference, we reduce inference latency by
1.3x, CPU resources by 30%, and network communication between application
front-end and ML back-end by about 50% for a commercial end-to-end ML platform
that serves millions of real-time decisions per second.
Related papers
- Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication [2.1301190271783317]
We present FSD-Inference, the first fully serverless and highly scalable system for distributed ML inference.
We introduce novel fully serverless communication schemes for ML inference workloads, leveraging both cloud-based publish-subscribe/queueing and object storage offerings.
arXiv Detail & Related papers (2024-03-22T13:31:24Z) - Distributed Inference and Fine-tuning of Large Language Models Over The
Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size.
These models require high-end hardware, making them inaccessible to most researchers.
We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Fast Distributed Inference Serving for Large Language Models [12.682341873843882]
Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT.
The interactive nature of these applications demand low job completion time (JCT) for model inference.
We present FastServe, a distributed inference serving system for LLMs.
arXiv Detail & Related papers (2023-05-10T06:17:50Z) - MPC-Pipe: an Efficient Pipeline Scheme for Secure Multi-party Machine
Learning Inference [3.1853566662905943]
Multi-party computing (MPC) has been gaining popularity over the past years as a secure computing model.
MPC has fewer overheads than homomorphic encryption (HE) and has a more robust threat model than hardware-based trusted execution environments.
MPC protocols still pay substantial performance penalties compared to plaintext when applied to machine learning algorithms.
arXiv Detail & Related papers (2022-09-27T19:16:26Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - Walle: An End-to-End, General-Purpose, and Large-Scale Production System
for Device-Cloud Collaborative Machine Learning [40.09527159285327]
We build the first end-to-end and general-purpose system, called Walle, for device-cloud collaborative machine learning (ML)
Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a cross-platform and high-performance execution environment.
We evaluate Walle in practical e-commerce application scenarios to demonstrate its effectiveness, efficiency, and scalability.
arXiv Detail & Related papers (2022-05-30T03:43:35Z) - Asynchronous Parallel Incremental Block-Coordinate Descent for
Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing.
For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data.
This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z) - Towards Efficient Post-training Quantization of Pre-trained Language
Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues.
Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z) - Multi-model Machine Learning Inference Serving with GPU Spatial
Partitioning [7.05946599544139]
High throughput machine learning (ML) inference servers are critical for online service applications.
These servers must provide a bounded latency for each request to support consistent service-level objective (SLO)
This paper proposes a new ML inference scheduling framework for multi-model ML inference servers.
arXiv Detail & Related papers (2021-09-01T04:46:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.