Related papers: Profiling LoRA/QLoRA Fine-Tuning Efficiency on Consumer GPUs: An RTX 4060 Case Study

Profiling LoRA/QLoRA Fine-Tuning Efficiency on Consumer GPUs: An RTX 4060 Case Study

URL: http://arxiv.org/abs/2509.12229v1
Date: Sun, 07 Sep 2025 21:41:14 GMT
Title: Profiling LoRA/QLoRA Fine-Tuning Efficiency on Consumer GPUs: An RTX 4060 Case Study
Authors: MSR Avinash,
Abstract summary: Fine-tuning large language models with parameter-efficient techniques such as LoRA and QLoRA has enabled adaptation of modest hardware.<n>We present a controlled profiling study of LoRA/QLoRA fine-tuning using the Qwen2.5-1.5B-Instruct model on a single NVIDIA GTX 4060.<n>Results show that pageds improve throughput by up to 25% (628 tok/s vs. 500 tok/s baseline), while bf16 degrades efficiency relative to fp16.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-tuning large language models (LLMs) with parameter-efficient techniques such as LoRA and QLoRA has enabled adaptation of foundation models on modest hardware. Yet the efficiency of such training on consumer-grade GPUs, especially under strict 8 GB VRAM limits, remains underexplored. We present a controlled profiling study of LoRA/QLoRA fine-tuning using the Qwen2.5-1.5B-Instruct model on a single NVIDIA RTX 4060. Across three representative configurations, we systematically vary batch size, sequence length, optimizer choice (AdamW vs. PagedAdamW), and precision (fp16 vs. bf16). We report throughput (tokens/s), time per 10k tokens, and VRAM footprint, alongside energy estimates derived from GPU board power limits. Our results show that paged optimizers improve throughput by up to 25% (628 tok/s vs. 500 tok/s baseline), while bf16 degrades efficiency relative to fp16. Despite 8 GB constraints, sequence lengths up to 2048 tokens were feasible using parameter-efficient strategies. To our knowledge, this is the first systematic case study of LLM fine- tuning efficiency on consumer GPUs, providing reproducible benchmarks and practical guidelines for resource-constrained researchers and practitioners.

Related papers

Karhunen-Loève Expansion-Based Residual Anomaly Map for Resource-Efficient Glioma MRI Segmentation [0.0]
Deep learning is currently the state-of-the-art for brain tumor segmentation.<n>State-of-the-art models use thousands of training cases and vast computational power, where performance drops sharply when either is limited.<n>This study implements a feature extraction step on downsampled, z-score normalized MRI volumes.<n>It achieves post-processed Dice scores of 0.929 (WT), 0.856 (TC), and 0.821 (ET), with HD95 distances of 2.93, 6.78, and 10.35 voxels.
arXiv Detail & Related papers (2026-01-16T23:48:40Z)
Performance Trade-offs of Optimizing Small Language Models for E-Commerce [1.0312968200748118]
Large Language Models (LLMs) offer state-of-the-art performance in natural language understanding and generation tasks.<n>This paper investigates the viability of smaller, open-weight models as a resource-efficient alternative.
arXiv Detail & Related papers (2025-10-24T18:49:28Z)
LoRA Fine-Tuning Without GPUs: A CPU-Efficient Meta-Generation Framework for LLMs [8.397730500554047]
Low-Rank Adapters (LoRAs) have transformed the fine-tuning of Large Language Models (LLMs) by enabling parameter-efficient updates.<n>We propose a theoretically grounded approach to LoRA fine-tuning designed specifically for users with limited computational resources.
arXiv Detail & Related papers (2025-07-02T15:24:47Z)
EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z)
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading [79.38548165722229]
HEADINFER offloads the KV cache to CPU RAM while avoiding the need to fully store the KV cache for any transformer layer on the GPU.<n>We demonstrate HEADINFER maintains computational efficiency while significantly reducing memory footprint.
arXiv Detail & Related papers (2025-02-18T06:26:05Z)
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z)
JORA: JAX Tensor-Parallel LoRA Library for Retrieval Augmented Fine-Tuning [16.86356520836045]
We introduce a novel framework for PEFT-compatible fine-tuning of Llama-2 models, leveraging distributed training. Our framework uniquely utilizes JAX's just-in-time (JIT) compilation and tensor-sharding for efficient resource management. Our experiments show more than 12x improvement in runtime compared to Hugging Face/DeepSpeed implementation with four GPU while consuming less than half the VRAM per GPU.
arXiv Detail & Related papers (2024-03-17T23:02:04Z)
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states. In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy. Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z)
mLoRA: Fine-Tuning LoRA Adapters via Highly-Efficient Pipeline Parallelism in Multiple GPUs [5.735411578779657]
Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is commonly used to adapt a base LLM to multiple downstream tasks. LoRA platforms enable developers to fine-tune multiple models and develop various domain-specific applications simultaneously. Existing model parallelism schemes suffer from high communication overhead and inefficient GPU utilization when training multiple LoRA tasks.
arXiv Detail & Related papers (2023-12-05T05:38:38Z)
Full Parameter Fine-tuning for Large Language Models with Limited Resources [55.794732214059806]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training. We propose a new computation, LOw-Memory Optimization (LOMO), which fuses the gradient and the parameter update in one step to reduce memory usage.
arXiv Detail & Related papers (2023-06-16T11:37:15Z)
QLoRA: Efficient Finetuning of Quantized LLMs [66.58009990713134]
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU. QLoRA backpropagates through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters(LoRA) Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark.
arXiv Detail & Related papers (2023-05-23T17:50:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.