Related papers: Speeding Up MACE: Low-Precision Tricks for Equivarient Force Fields

Speeding Up MACE: Low-Precision Tricks for Equivarient Force Fields

URL: http://arxiv.org/abs/2510.23621v1
Date: Thu, 23 Oct 2025 14:02:34 GMT
Title: Speeding Up MACE: Low-Precision Tricks for Equivarient Force Fields
Authors: Alexandre Benoit,
Abstract summary: Machine-learning force fields can deliver accurate molecular dynamics (MD) at high computational cost.<n>This thesis aims to make MACE cheaper and faster by identifying computational bottlenecks and evaluating low-precision execution policies.
Score: 51.95157731126864
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Machine-learning force fields can deliver accurate molecular dynamics (MD) at high computational cost. For SO(3)-equivariant models such as MACE, there is little systematic evidence on whether reduced-precision arithmetic and GPU-optimized kernels can cut this cost without harming physical fidelity. This thesis aims to make MACE cheaper and faster while preserving accuracy by identifying computational bottlenecks and evaluating low-precision execution policies. We profile MACE end-to-end and per block, compare the e3nn and NVIDIA cuEquivariance backends, and assess FP64/FP32/BF16/FP16 settings (with FP32 accumulation) for inference, short NVT and long NPT water simulations, and toy training runs under reproducible, steady-state timing. cuEquivariance reduces inference latency by about $3\times$. Casting only linear layers to BF16/FP16 within an FP32 model yields roughly 4x additional speedups, while energies and thermodynamic observables in NVT/NPT MD remain within run-to-run variability. Half-precision weights during training degrade force RMSE. Mixing e3nn and cuEq modules without explicit adapters causes representation mismatches. Fused equivariant kernels and mixed-precision inference can substantially accelerate state-of-the-art force fields with negligible impact on downstream MD. A practical policy is to use cuEquivariance with FP32 by default and enable BF16/FP16 for linear layers (keeping FP32 accumulations) for maximum throughput, while training remains in FP32. Further gains are expected on Ampere/Hopper GPUs (TF32/BF16) and from kernel-level FP16/BF16 paths and pipeline fusion.

Related papers

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling [13.357423392911036]
We introduce Four Over Six (4/6), a modification to the NVFP4 quantization algorithm that evaluates two potential scale factors for each block of values.<n>We find that for some blocks, scaling to smaller FP4 values makes the distribution of representable values more uniform.<n>We also find that 4/6 can be easily incorporated into many different post-training quantization methods and generally improves downstream accuracy.
arXiv Detail & Related papers (2025-12-01T18:59:45Z)
MeanFlow Transformers with Representation Autoencoders [71.45823902973349]
MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data.<n>We develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE)<n>We achieve a 1-step FID of 2.03, outperforming vanilla MF's 3.43, while reducing sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256.
arXiv Detail & Related papers (2025-11-17T06:17:08Z)
Defeating the Training-Inference Mismatch via FP16 [72.25890308541334]
Reinforcement learning (RL) fine-tuning often suffers from instability due to the numerical mismatch between the training and inference policies.<n>We show that its root cause lies in the floating point precision itself.<n>The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference.
arXiv Detail & Related papers (2025-10-30T17:58:11Z)
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats [51.72056104795248]
Modern AI hardware, such as Nvidia's Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats.<n>This paper systematically investigates the trade-offs between FP and integer (INT) formats.<n>We reveal a critical performance crossover: while FP excels in coarse-grained quantization, the comparison at fine-grained (block-wise) levels is more nuanced.
arXiv Detail & Related papers (2025-10-29T15:11:53Z)
Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization [77.67818998672516]
We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization.<n>We introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm.<n>We show that MR-GPTQ matches or outperforms state-of-the-art accuracy.
arXiv Detail & Related papers (2025-09-27T09:22:21Z)
Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference [31.2331188304598]
Changing system configuration, such as evaluation batch size, GPU count, and GPU version, can introduce significant differences in the generated responses.<n>We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision.<n>Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32.
arXiv Detail & Related papers (2025-06-11T08:23:53Z)
Training LLMs with MXFP4 [16.524414449291488]
We present the first near-lossless training recipe that uses MXFP4 GEMMs, which are $2times$ faster than FP8 on supported hardware.<n>Our recipe computes $>1/2$ the training FLOPs in MXFP4, enabling an estimated speedup of $>1.3times$ over FP8 and $>1.7times$ over BF16 during backpropagation.
arXiv Detail & Related papers (2025-02-27T23:01:31Z)
To FP8 and Back Again: Quantifying Reduced Precision Effects on LLM Training Stability [7.115739465137031]
BrainFloat16 (BF16) precision has become the de facto standard for large language model pretraining.<n>However, prior experience with FP16, which was found to be less stable than BF16, raises concerns as to whether FP8 can be a cost-effective option for LLM training.<n>We propose new evaluation techniques and a new metric for quantifying loss landscape sharpness in autoregressive language models.
arXiv Detail & Related papers (2024-05-29T02:42:23Z)
FP8-LM: Training FP8 Large Language Models [47.17804713425323]
In this paper, we propose a new FP8 automatic mixed-precision framework for training large language models. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework.
arXiv Detail & Related papers (2023-10-27T17:59:51Z)
Data-Free Dynamic Compression of CNNs for Tractable Efficiency [46.498278084317704]
structured pruning approaches have shown promise in lowering floating-point operations without substantial drops in accuracy.<n>We propose HASTE (Hashing for Tractable Efficiency), a data-free, plug-and-play convolution module that instantly reduces a network's test-time inference cost without training or fine-tuning.<n>We demonstrate our approach on the popular vision benchmarks CIFAR-10 and ImageNet, where we achieve a 46.72% reduction in FLOPs with only a 1.25% loss in accuracy.
arXiv Detail & Related papers (2023-09-29T13:09:40Z)
FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference [1.1292678337479967]
fbgemm is a high-performance kernel library for high-performance quantized inference on current generation CPUs. fbgemm achieves efficiency by fusing common quantization operations with a high-performance gemm implementation and by shape- and size-specific kernel code generation at runtime. The library has been deployed at Facebook, where it delivers greater than 2x performance gains with respect to our current production baseline.
arXiv Detail & Related papers (2021-01-13T00:34:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.