Enabling Fast Differentially Private SGD via Just-in-Time Compilation
and Vectorization
- URL: http://arxiv.org/abs/2010.09063v2
- Date: Tue, 26 Oct 2021 19:54:51 GMT
- Title: Enabling Fast Differentially Private SGD via Just-in-Time Compilation
and Vectorization
- Authors: Pranav Subramani, Nicholas Vadivelu, Gautam Kamath
- Abstract summary: A common pain point in differentially private machine learning is the significant runtime overhead incurred when executing Differentially Private Gradient Descent (DPSGD)
We demonstrate that by exploiting powerful language primitives, one can dramatically reduce these overheads, in many cases nearly matching the best non-private running times.
- Score: 8.404254529115835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A common pain point in differentially private machine learning is the
significant runtime overhead incurred when executing Differentially Private
Stochastic Gradient Descent (DPSGD), which may be as large as two orders of
magnitude. We thoroughly demonstrate that by exploiting powerful language
primitives, including vectorization, just-in-time compilation, and static graph
optimization, one can dramatically reduce these overheads, in many cases nearly
matching the best non-private running times. These gains are realized in two
frameworks: JAX and TensorFlow. JAX provides rich support for these primitives
as core features of the language through the XLA compiler. We also rebuild core
parts of TensorFlow Privacy, integrating features from TensorFlow 2 as well as
XLA compilation, granting significant memory and runtime improvements over the
current release version. These approaches allow us to achieve up to 50x
speedups in comparison to the best alternatives. Our code is available at
https://github.com/TheSalon/fast-dpsgd.
Related papers
- vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - JORA: JAX Tensor-Parallel LoRA Library for Retrieval Augmented Fine-Tuning [16.86356520836045]
We introduce a novel framework for PEFT-compatible fine-tuning of Llama-2 models, leveraging distributed training.
Our framework uniquely utilizes JAX's just-in-time (JIT) compilation and tensor-sharding for efficient resource management.
Our experiments show more than 12x improvement in runtime compared to Hugging Face/DeepSpeed implementation with four GPU while consuming less than half the VRAM per GPU.
arXiv Detail & Related papers (2024-03-17T23:02:04Z) - Green AI: A Preliminary Empirical Study on Energy Consumption in DL
Models Across Different Runtime Infrastructures [56.200335252600354]
It is common practice to deploy pre-trained models on environments distinct from their native development settings.
This led to the introduction of interchange formats such as ONNX, which includes its infrastructure, and ONNX, which work as standard formats.
arXiv Detail & Related papers (2024-02-21T09:18:44Z) - JaxMARL: Multi-Agent RL Environments and Algorithms in JAX [105.343918678781]
We present JaxMARL, the first open-source, Python-based library that combines GPU-enabled efficiency with support for a large number of commonly used MARL environments.
Our experiments show that, in terms of wall clock time, our JAX-based training pipeline is around 14 times faster than existing approaches.
We also introduce and benchmark SMAX, a JAX-based approximate reimplementation of the popular StarCraft Multi-Agent Challenge.
arXiv Detail & Related papers (2023-11-16T18:58:43Z) - PockEngine: Sparse and Efficient Fine-tuning in a Pocket [62.955793932377524]
We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices.
PockEngine supports sparse backpropagation and sparsely updates the model with measured memory saving and latency reduction.
Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin at 550 tokens/s, 7.9$times$ faster than the PyTorch.
arXiv Detail & Related papers (2023-10-26T19:46:11Z) - CHERI Performance Enhancement for a Bytecode Interpreter [0.0]
We show that it is possible to eliminate certain kinds of software-induced runtime overhead that occur due to the larger size of CHERI capabilities (128 bits) relative to native pointers (generally 64 bits)
The worst-case slowdowns are greatly improved, from 100x (before optimization) to 2x (after optimization)
arXiv Detail & Related papers (2023-08-09T17:12:23Z) - PowerFusion: A Tensor Compiler with Explicit Data Movement Description
and Instruction-level Graph IR [10.059491353103526]
We propose IntelliGen, a tensor compiler that can generate high-performance code for memory-intensive operators.
IntelliGen considers both computation and data movement optimizations.
We evaluate IntelliGen on NVIDIA GPU, AMD GPU, and Cambricon MLU, showing speedup up to 1.97x, 2.93x, and 16.91x (1.28x, 1.23x, and 2.31x on average)
arXiv Detail & Related papers (2023-07-11T03:17:40Z) - FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z) - SegNeXt: Rethinking Convolutional Attention Design for Semantic
Segmentation [100.89770978711464]
We present SegNeXt, a simple convolutional network architecture for semantic segmentation.
We show that convolutional attention is a more efficient and effective way to encode contextual information than the self-attention mechanism in transformers.
arXiv Detail & Related papers (2022-09-18T14:33:49Z) - Systolic Computing on GPUs for Productive Performance [2.8064596842326575]
We propose a language and compiler to productively build high-performance systolic arrays that run on GPUs.
A programmer it' specifies a projection of a dataflow compute onto a linear systolic array, while leaving the detailed implementation of the projection to a compiler.
The compiler implements the specified projection and maps the linear systolic array to the SIMD execution units and vector registers of GPUs.
arXiv Detail & Related papers (2020-10-29T18:49:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.