BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems
- URL: http://arxiv.org/abs/2503.13795v1
- Date: Tue, 18 Mar 2025 00:52:12 GMT
- Title: BurTorch: Revisiting Training from First Principles by Coupling Autodiff, Math Optimization, and Systems
- Authors: Konstantin Burlachenko, Peter Richtárik,
- Abstract summary: BurTorch is a compact high-performance framework designed to optimize Deep Learning (DL) training on single-node workstations.<n>BurTorch adopts a minimalist design and demonstrates that, in these circumstances, classical compiled programming languages can play a significant role in DL research.
- Score: 56.16884466478886
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we introduce BurTorch, a compact high-performance framework designed to optimize Deep Learning (DL) training on single-node workstations through an exceptionally efficient CPU-based backpropagation (Rumelhart et al., 1986; Linnainmaa, 1970) implementation. Although modern DL frameworks rely on compilerlike optimizations internally, BurTorch takes a different path. It adopts a minimalist design and demonstrates that, in these circumstances, classical compiled programming languages can play a significant role in DL research. By eliminating the overhead of large frameworks and making efficient implementation choices, BurTorch achieves orders-of-magnitude improvements in performance and memory efficiency when computing $\nabla f(x)$ on a CPU. BurTorch features a compact codebase designed to achieve two key goals simultaneously. First, it provides a user experience similar to script-based programming environments. Second, it dramatically minimizes runtime overheads. In large DL frameworks, the primary source of memory overhead for relatively small computation graphs $f(x)$ is due to feature-heavy implementations. We benchmarked BurTorch against widely used DL frameworks in their execution modes: JAX (Bradbury et al., 2018), PyTorch (Paszke et al., 2019), TensorFlow (Abadi et al., 2016); and several standalone libraries: Autograd (Maclaurin et al., 2015), Micrograd (Karpathy, 2020), Apple MLX (Hannun et al., 2023). For small compute graphs, BurTorch outperforms best-practice solutions by up to $\times 2000$ in runtime and reduces memory consumption by up to $\times 3500$. For a miniaturized GPT-3 model (Brown et al., 2020), BurTorch achieves up to a $\times 20$ speedup and reduces memory up to $\times 80$ compared to PyTorch.
Related papers
- Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency [26.173523821684306]
A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance.<n>Experiments on large language models with $7 sim 70$ billion parameters show that $D3$ can achieve an average 1.5x speedup compared with the full-inference pipeline.
arXiv Detail & Related papers (2025-03-11T15:15:54Z) - Approximate Top-$k$ for Increased Parallelism [1.2557921586915128]
We present an evaluation of bucketed approximate top-$k$ algorithms.<n>By relaxing the requirement that the top-$k$ is exact, bucketed algorithms can dramatically increase the parallelism available.<n>We also release a fast bucketed top-$k$ implementation for PyTorch.
arXiv Detail & Related papers (2024-12-05T17:17:28Z) - FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training [51.39495282347475]
We introduce $textttFRUGAL$ ($textbfF$ull-$textbfR$ank $textbfU$pdates with $textbfG$r$textbfA$dient sp$textbfL$itting, a new memory-efficient optimization framework.
Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam.
arXiv Detail & Related papers (2024-11-12T14:41:07Z) - Bundle Adjustment in the Eager Mode [14.13835018035969]
We introduce an eager-mode Bundle adjustment framework seamlessly integrated with PyPose.
Our approach includes GPU-accelerated, differentiable, and sparse operations designed for 2nd-order optimization, Lie group and Lie algebra operations, and linear solvers.
Our approach demonstrates substantial efficiency, achieving an average speedup of 18.5$times$, 22$times$, and 23$times$ compared to GTSAM, g$2$o, and Ceres, respectively.
arXiv Detail & Related papers (2024-09-18T17:59:29Z) - Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.
We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.
Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - Green AI: A Preliminary Empirical Study on Energy Consumption in DL
Models Across Different Runtime Infrastructures [56.200335252600354]
It is common practice to deploy pre-trained models on environments distinct from their native development settings.
This led to the introduction of interchange formats such as ONNX, which includes its infrastructure, and ONNX, which work as standard formats.
arXiv Detail & Related papers (2024-02-21T09:18:44Z) - PockEngine: Sparse and Efficient Fine-tuning in a Pocket [62.955793932377524]
We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices.
PockEngine supports sparse backpropagation and sparsely updates the model with measured memory saving and latency reduction.
Remarkably, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin at 550 tokens/s, 7.9$times$ faster than the PyTorch.
arXiv Detail & Related papers (2023-10-26T19:46:11Z) - MAP: Memory-aware Automated Intra-op Parallel Training For Foundation
Models [15.256207550970501]
We introduce MAP, a compiler built upon PyTorch to implement Memory-aware Automated Parallelization.
Compared with existing methods, MAP provides an easy-to-use symbolic profiler to generate memory and computing statistics of an arbitrary PyTorch model.
arXiv Detail & Related papers (2023-02-06T07:22:49Z) - LoopStack: a Lightweight Tensor Algebra Compiler Stack [61.04098601022665]
LoopStack is a domain specific compiler stack for tensor operations.
It generates machine code that matches and frequently exceeds the performance of in state-of-the-art machine learning frameworks.
It has a very small memory footprint - a binary size of 245KB, and under 30K lines of effective code makes it ideal for use on mobile and embedded devices.
arXiv Detail & Related papers (2022-05-02T01:57:58Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z) - Memory Optimization for Deep Networks [10.519610439720909]
We present MONeT, an automatic framework that minimizes both the memory footprint and computational overhead of deep networks.
MoneT reduces the overall memory requirement by 3x for various PyTorch models, with a 9-16% overhead in computation.
For the same computation cost, MONeT requires 1.2-1.8x less memory than current state-of-the-art automated checkpointing frameworks.
arXiv Detail & Related papers (2020-10-27T17:57:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.