Related papers: Enabling Fast Differentially Private SGD via Just-in-Time Compilation and Vectorization

Enabling Fast Differentially Private SGD via Just-in-Time Compilation and Vectorization

URL: http://arxiv.org/abs/2010.09063v2
Date: Tue, 26 Oct 2021 19:54:51 GMT
Title: Enabling Fast Differentially Private SGD via Just-in-Time Compilation and Vectorization
Authors: Pranav Subramani, Nicholas Vadivelu, Gautam Kamath
Abstract summary: A common pain point in differentially private machine learning is the significant runtime overhead incurred when executing Differentially Private Gradient Descent (DPSGD) We demonstrate that by exploiting powerful language primitives, one can dramatically reduce these overheads, in many cases nearly matching the best non-private running times.
Score: 8.404254529115835
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A common pain point in differentially private machine learning is the significant runtime overhead incurred when executing Differentially Private Stochastic Gradient Descent (DPSGD), which may be as large as two orders of magnitude. We thoroughly demonstrate that by exploiting powerful language primitives, including vectorization, just-in-time compilation, and static graph optimization, one can dramatically reduce these overheads, in many cases nearly matching the best non-private running times. These gains are realized in two frameworks: JAX and TensorFlow. JAX provides rich support for these primitives as core features of the language through the XLA compiler. We also rebuild core parts of TensorFlow Privacy, integrating features from TensorFlow 2 as well as XLA compilation, granting significant memory and runtime improvements over the current release version. These approaches allow us to achieve up to 50x speedups in comparison to the best alternatives. Our code is available at https://github.com/TheSalon/fast-dpsgd.

Related papers

Fun with flags: How Compilers Break and Fix Constant-Time Code [0.0]
We analyze how compiler optimizations break constant-time code.<n>Key insight is that a small set of passes are at the root of most leaks.<n>We propose an original and practical mitigation that requires no source code modification or custom compiler.
arXiv Detail & Related papers (2025-07-08T15:52:17Z)
BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems [56.16884466478886]
BurTorch is a compact high-performance framework designed to optimize Deep Learning (DL) training on single-node workstations. BurTorch adopts a minimalist design and demonstrates that, in these circumstances, classical compiled programming languages can play a significant role in DL research.
arXiv Detail & Related papers (2025-03-18T00:52:12Z)
Keras Sig: Efficient Path Signature Computation on GPU in Keras 3 [0.0]
Keras Sig is a high-performance pythonic library designed to compute path signature for deep learning applications. Entirely built in Keras 3, textitKeras Sig leverages the seamless integration with the mostly used deep learning backends such as PyTorch, JAX and GPU.
arXiv Detail & Related papers (2025-01-14T22:00:01Z)
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z)
JORA: JAX Tensor-Parallel LoRA Library for Retrieval Augmented Fine-Tuning [16.86356520836045]
We introduce a novel framework for PEFT-compatible fine-tuning of Llama-2 models, leveraging distributed training. Our framework uniquely utilizes JAX's just-in-time (JIT) compilation and tensor-sharding for efficient resource management. Our experiments show more than 12x improvement in runtime compared to Hugging Face/DeepSpeed implementation with four GPU while consuming less than half the VRAM per GPU.
arXiv Detail & Related papers (2024-03-17T23:02:04Z)
Green AI: A Preliminary Empirical Study on Energy Consumption in DL Models Across Different Runtime Infrastructures [56.200335252600354]
It is common practice to deploy pre-trained models on environments distinct from their native development settings. This led to the introduction of interchange formats such as ONNX, which includes its infrastructure, and ONNX, which work as standard formats.
arXiv Detail & Related papers (2024-02-21T09:18:44Z)
JaxMARL: Multi-Agent RL Environments and Algorithms in JAX [105.343918678781]
We present JaxMARL, the first open-source, Python-based library that combines GPU-enabled efficiency with support for a large number of commonly used MARL environments. Our experiments show that, in terms of wall clock time, our JAX-based training pipeline is around 14 times faster than existing approaches. We also introduce and benchmark SMAX, a JAX-based approximate reimplementation of the popular StarCraft Multi-Agent Challenge.
arXiv Detail & Related papers (2023-11-16T18:58:43Z)
PockEngine: Sparse and Efficient Fine-tuning in a Pocket [62.955793932377524]
We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation and sparsely updates the model with measured memory saving and latency reduction. Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin at 550 tokens/s, 7.9$times$ faster than the PyTorch.
arXiv Detail & Related papers (2023-10-26T19:46:11Z)
CHERI Performance Enhancement for a Bytecode Interpreter [0.0]
We show that it is possible to eliminate certain kinds of software-induced runtime overhead that occur due to the larger size of CHERI capabilities (128 bits) relative to native pointers (generally 64 bits) The worst-case slowdowns are greatly improved, from 100x (before optimization) to 2x (after optimization)
arXiv Detail & Related papers (2023-08-09T17:12:23Z)
PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR [10.059491353103526]
We propose IntelliGen, a tensor compiler that can generate high-performance code for memory-intensive operators. IntelliGen considers both computation and data movement optimizations. We evaluate IntelliGen on NVIDIA GPU, AMD GPU, and Cambricon MLU, showing speedup up to 1.97x, 2.93x, and 16.91x (1.28x, 1.23x, and 2.31x on average)
arXiv Detail & Related papers (2023-07-11T03:17:40Z)
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU. When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z)
SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation [100.89770978711464]
We present SegNeXt, a simple convolutional network architecture for semantic segmentation. We show that convolutional attention is a more efficient and effective way to encode contextual information than the self-attention mechanism in transformers.
arXiv Detail & Related papers (2022-09-18T14:33:49Z)
Systolic Computing on GPUs for Productive Performance [2.8064596842326575]
We propose a language and compiler to productively build high-performance systolic arrays that run on GPUs. A programmer it' specifies a projection of a dataflow compute onto a linear systolic array, while leaving the detailed implementation of the projection to a compiler. The compiler implements the specified projection and maps the linear systolic array to the SIMD execution units and vector registers of GPUs.
arXiv Detail & Related papers (2020-10-29T18:49:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.