NumS: Scalable Array Programming for the Cloud
- URL: http://arxiv.org/abs/2206.14276v1
- Date: Tue, 28 Jun 2022 20:13:40 GMT
- Title: NumS: Scalable Array Programming for the Cloud
- Authors: Melih Elibol, Vinamra Benara, Samyu Yagati, Lianmin Zheng, Alvin
Cheung, Michael I. Jordan, Ion Stoica
- Abstract summary: We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
- Score: 82.827921577004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scientists increasingly rely on Python tools to perform scalable distributed
memory array operations using rich, NumPy-like expressions. However, many of
these tools rely on dynamic schedulers optimized for abstract task graphs,
which often encounter memory and network bandwidth-related bottlenecks due to
sub-optimal data and operator placement decisions. Tools built on the message
passing interface (MPI), such as ScaLAPACK and SLATE, have better scaling
properties, but these solutions require specialized knowledge to use. In this
work, we present NumS, an array programming library which optimizes NumPy-like
expressions on task-based distributed systems. This is achieved through a novel
scheduler called Load Simulated Hierarchical Scheduling (LSHS). LSHS is a local
search method which optimizes operator placement by minimizing maximum memory
and network load on any given node within a distributed system. Coupled with a
heuristic for load balanced data layouts, our approach is capable of attaining
communication lower bounds on some common numerical operations, and our
empirical study shows that LSHS enhances performance on Ray by decreasing
network load by a factor of 2x, requiring 4x less memory, and reducing
execution time by 10x on the logistic regression problem. On terabyte-scale
data, NumS achieves competitive performance to SLATE on DGEMM, up to 20x
speedup over Dask on a key operation for tensor factorization, and a 2x speedup
on logistic regression compared to Dask ML and Spark's MLlib.
Related papers
- vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System [21.09681871279162]
Modern Machine Learning (ML) training on large-scale datasets is a time-consuming workload.
It relies on the optimization algorithm Gradient Descent (SGD) due to its effectiveness, simplicity, and generalization performance.
processor-centric architectures suffer from low performance and high energy consumption while executing ML training workloads.
Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck.
arXiv Detail & Related papers (2024-04-10T17:00:04Z) - Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors.
We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval.
Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets.
Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z) - Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse
Multi-DNN Workloads [65.47816359465155]
Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices.
We propose Dysta, a novel scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling.
Our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time.
arXiv Detail & Related papers (2023-10-17T09:25:17Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - A Theory of I/O-Efficient Sparse Neural Network Inference [17.862408781750126]
Machine learning models increase their accuracy at a fast rate, so their demand for energy and compute resources increases.
On a low level, the major part of these resources is consumed by data movement between different memory units.
We provide a rigorous theoretical analysis of the I/Os needed in sparse feedforward neural network (FFNN) inference.
arXiv Detail & Related papers (2023-01-03T11:23:46Z) - Pex: Memory-efficient Microcontroller Deep Learning through Partial
Execution [11.336229510791481]
We discuss a novel execution paradigm for microcontroller deep learning.
It modifies the execution of neural networks to avoid materialising full buffers in memory.
This is achieved by exploiting the properties of operators, which can consume/produce a fraction of their input/output at a time.
arXiv Detail & Related papers (2022-11-30T18:47:30Z) - Walle: An End-to-End, General-Purpose, and Large-Scale Production System
for Device-Cloud Collaborative Machine Learning [40.09527159285327]
We build the first end-to-end and general-purpose system, called Walle, for device-cloud collaborative machine learning (ML)
Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing a cross-platform and high-performance execution environment.
We evaluate Walle in practical e-commerce application scenarios to demonstrate its effectiveness, efficiency, and scalability.
arXiv Detail & Related papers (2022-05-30T03:43:35Z) - HeAT -- a Distributed and GPU-accelerated Tensor Framework for Data
Analytics [0.0]
HeAT is an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API.
HeAT utilizes PyTorch as a node-local eager execution engine and distributes the workload on arbitrarily large high-performance computing systems via MPI.
When compared to similar frameworks, HeAT achieves speedups of up to two orders of magnitude.
arXiv Detail & Related papers (2020-07-27T13:33:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.