LLMQ: Efficient Lower-Precision Pretraining for Consumer GPUs
- URL: http://arxiv.org/abs/2512.15306v1
- Date: Wed, 17 Dec 2025 10:51:45 GMT
- Title: LLMQ: Efficient Lower-Precision Pretraining for Consumer GPUs
- Authors: Erik Schultheis, Dan Alistarh,
- Abstract summary: We present an end-to-end/C++ implementation for medium-sized language-model training, e.g. 3B to 32B parameters, on commodity GPUs.<n>This is achieved while executing a standard 8-bit training pipeline, without additional algorithmic approximations, and maintaining FLOP utilization of around 50%.
- Score: 45.51664355320938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present LLMQ, an end-to-end CUDA/C++ implementation for medium-sized language-model training, e.g. 3B to 32B parameters, on affordable, commodity GPUs. These devices are characterized by low memory availability and slow communication compared to datacentre-grade GPUs. Consequently, we showcase a range of optimizations that target these bottlenecks, including activation checkpointing, offloading, and copy-engine based collectives. LLMQ is able to train or fine-tune a 7B model on a single 16GB mid-range gaming card, or a 32B model on a workstation equipped with 4 RTX 4090s. This is achieved while executing a standard 8-bit training pipeline, without additional algorithmic approximations, and maintaining FLOP utilization of around 50%. The efficiency of LLMQ rivals that of production-scale systems on much more expensive cloud-grade GPUs.
Related papers
- Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving [2.6336040306318274]
Large Language Model (LLM) adapters enable low-cost model specialization.<n>LLM adapters introduce complex caching and scheduling challenges in distributed serving systems where hundreds of adapters must be hosted concurrently.<n>This paper presents a data-driven pipeline that computes an adapter placement that serves the workload with the minimum number of GPU.
arXiv Detail & Related papers (2026-02-27T14:22:51Z) - Can Large Language Models Predict Parallel Code Performance? [1.5221392705893568]
This paper explores whether Large Language Models (LLMs) can offer an alternative approach for GPU performance prediction without relying on hardware.<n>LLMs have a strong understanding of the Roofline model, achieving 100% classification accuracy when provided with explicit profiling data.<n>Our findings suggest that with better datasets and prompt strategies, LLMs could become practical tools for HPC roofline analysis and performance portability.
arXiv Detail & Related papers (2025-05-06T21:41:20Z) - MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - Practical offloading for fine-tuning LLM on commodity GPU via learned sparse projectors [11.127604539303373]
Fine-tuning large language models (LLMs) requires significant memory, often exceeding the capacity of a single GPU.<n>A common solution to this memory challenge is offloading compute and data from the GPU to the CPU.<n>We present an offloading framework, LSP-Offload, that enables near-native speed LLM fine-tuning on commodity hardware.
arXiv Detail & Related papers (2024-06-14T16:59:11Z) - MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter [40.616849959987555]
We introduce a novel mechanism that fine-tunes Large Language Models (LLMs) with adapters of larger size yet memory-efficient.
This is achieved by leveraging the inherent activation sparsity in the Feed-Forward Networks (FFNs) of LLMs.
We employ a Mixture of Experts (MoE)-like architecture to mitigate unnecessary CPU computations and reduce the communication volume between the GPU and CPU.
arXiv Detail & Related papers (2024-06-07T14:49:22Z) - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states.
In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy.
Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Full Parameter Fine-tuning for Large Language Models with Limited Resources [55.794732214059806]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training.
We propose a new computation, LOw-Memory Optimization (LOMO), which fuses the gradient and the parameter update in one step to reduce memory usage.
arXiv Detail & Related papers (2023-06-16T11:37:15Z) - A TinyML Platform for On-Device Continual Learning with Quantized Latent
Replays [66.62377866022221]
Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle.
We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor.
Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
arXiv Detail & Related papers (2021-10-20T11:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.