Related papers: ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces

ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces

URL: http://arxiv.org/abs/2510.11168v1
Date: Mon, 13 Oct 2025 08:59:13 GMT
Title: ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces
Authors: Jinbin Zhang, Nasib Ullah, Erik Schultheis, Rohit Babbar,
Abstract summary: We propose a low-precision training framework for Extreme multilabel classification.<n>Low-precision training, combined with proposed memory optimizations, enables significant reductions in GPU memory usage.<n>For example, we train a 3-million-label XMC model with only 6.6 GiB of GPU memory, compared to the 39.7 GiB required by the optimized SOTA method.
Score: 13.242009624334996
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large output spaces, also referred to as Extreme multilabel classification (XMC), is a setting that arises, e.g., in large-scale tagging and product-to-product recommendation, and is characterized by the number of labels ranging from hundreds of thousands to millions. This means that the linear classification head, usually only a tiny fraction of the overall model, turns into the main driver for compute and memory demand. Current state-of-the-art XMC methods predominantly rely on FP16-FP32 mixed-precision training, which we show can be unstable, and inefficient in terms of memory usage and computational overhead. Meanwhile, existing low-precision methods typically retain higher precision for the classification layer. In this work, we propose ELMO, a pure low-precision training framework for XMC models using BFloat16 and Float8 data types. By leveraging Kahan summation and stochastic rounding, we demonstrate that XMC models can be effectively trained entirely in Float8, without relying on single-precision master weights or tensor scaling. Low-precision training, combined with our proposed memory optimizations -- gradient fusion and chunking -- enables significant reductions in GPU memory usage. For example, we train a 3-million-label XMC model with only 6.6 GiB of GPU memory, compared to the 39.7 GiB required by the optimized SOTA method, Renee without compromising accuracy.

Related papers

SFMP: Fine-Grained, Hardware-Friendly and Search-Free Mixed-Precision Quantization for Large Language Models [4.269807933198402]
Mixed-precision quantization is a promising approach for compressing large language models under tight memory budgets.<n>We propose SFMP, a search-free and hardware-friendly mixed-precision quantization framework for large language models.
arXiv Detail & Related papers (2026-02-01T05:24:19Z)
ECO: Quantized Training without Full-Precision Master Weights [58.97082407934466]
Error-Compensating (ECO) eliminates master weights by applying updates directly to quantized parameters.<n>We show that ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate.
arXiv Detail & Related papers (2026-01-29T18:35:01Z)
MPX: Mixed Precision Training for JAX [54.62458721568289]
Mixed-precision training has emerged as an indispensable tool for enhancing the efficiency of neural network training.<n>We propose MPX, a mixed-precision training toolbox for JAX that simplifies and accelerates the training of large-scale neural networks.<n>MPX seamlessly integrates with popular toolboxes such as Equinox and Flax, allowing users to convert full-precision pipelines to mixed-precision versions.
arXiv Detail & Related papers (2025-07-04T05:47:04Z)
DataDecide: How to Predict Best Pretraining Data with Small Experiments [67.95896457895404]
We release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale.<n>We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds.
arXiv Detail & Related papers (2025-04-15T17:02:15Z)
Direct Quantized Training of Language Models with Stochastic Rounding [8.442358264368693]
Experimental results on LLaMA-structured models of various sizes indicate that training with lowprecision weights is feasible even when constrained to ternary values.<n>Our models remain robust to precision scaling and memory reduction, showing minimal performance degradation when moving from FP32 to lower-memory environments.
arXiv Detail & Related papers (2024-12-06T05:41:11Z)
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states. In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy. Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z)
Dual-Encoders for Extreme Multi-Label Classification [19.312120188406514]
We show that Dual-encoder (DE) models fall significantly short on extreme multi-label classification (XMC) benchmarks. We propose a simple modification to the InfoNCE loss that overcomes the limitations of existing contrastive losses. When trained with our proposed loss functions, standard DE models alone can match or outperform SOTA methods by up to 2% at Precision@1.
arXiv Detail & Related papers (2023-10-16T17:55:43Z)
Fine-Tuning Language Models with Just Forward Passes [92.04219196752007]
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a large amount of memory. We propose a memory-efficient zerothorder (MeZO) to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference.
arXiv Detail & Related papers (2023-05-27T02:28:10Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.