Chronicals: A High-Performance Framework for LLM Fine-Tuning with 3.51x Speedup over Unsloth
- URL: http://arxiv.org/abs/2601.02609v1
- Date: Tue, 06 Jan 2026 00:00:55 GMT
- Title: Chronicals: A High-Performance Framework for LLM Fine-Tuning with 3.51x Speedup over Unsloth
- Authors: Arjun S. Nair,
- Abstract summary: We present Chronicals, an open-source training framework achieving 3.5x speedup over Unsloth.<n>We provide complete mathematical foundations: online softmax correctness, FlashAttention IO complexity O(N2 d2 M-1), LoRA+ learning rate gradient approximations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language model fine-tuning is bottlenecked by memory: a 7B parameter model requires 84GB--14GB for weights, 14GB for gradients, and 56GB for FP32 optimizer states--exceeding even A100-40GB capacity. We present Chronicals, an open-source training framework achieving 3.51x speedup over Unsloth through four synergistic optimizations: (1) fused Triton kernels eliminating 75% of memory traffic via RMSNorm (7x), SwiGLU (5x), and QK-RoPE (2.3x) fusion; (2) Cut Cross-Entropy reducing logit memory from 5GB to 135MB through online softmax computation; (3) LoRA+ with theoretically-derived 16x differential learning rates between adapter matrices; and (4) Best-Fit Decreasing sequence packing recovering 60-75% of compute wasted on padding. On Qwen2.5-0.5B with A100-40GB, Chronicals achieves 41,184 tokens/second for full fine-tuning versus Unsloth's 11,736 tokens/second (3.51x). For LoRA at rank 32, we reach 11,699 tokens/second versus Unsloth MAX's 2,857 tokens/second (4.10x). Critically, we discovered that Unsloth's reported 46,000 tokens/second benchmark exhibited zero gradient norms--the model was not training. We provide complete mathematical foundations: online softmax correctness proofs, FlashAttention IO complexity bounds O(N^2 d^2 M^{-1}), LoRA+ learning rate derivations from gradient magnitude analysis, and bin-packing approximation guarantees. All implementations, benchmarks, and proofs are available at https://github.com/Ajwebdevs/Chronicals with pip installation via https://pypi.org/project/chronicals/.
Related papers
- Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding [0.0]
We present Nacrith, a compression system that achieves the best compression results among the systems evaluated in this study on natural language text.<n>The system requires only 500 MB of GGUF weights and 1.2 GB VRAM per worker, running on consumer GPU.<n>On alice29 (Canterbury Corpus, 152 KB), Nacrith achieves 0.918 bits per byte (bpb), while compressing below the 0th-, 1st-, and 2nd-order Shannon entropy bounds.
arXiv Detail & Related papers (2026-02-23T09:14:05Z) - Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs [11.45717904490388]
Recent advances in transformer-based foundation models have made them the default choice for many tasks.<n>Their rapidly growing size makes fitting a full model on a single GPU increasingly difficult and their computational cost prohibitive.<n>Block low-rank (BLR) compression techniques address this challenge by learning compact representations of weight matrices.
arXiv Detail & Related papers (2025-12-24T00:41:13Z) - SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations [54.303301888915406]
Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost.<n>We propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching.<n>We also propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels.
arXiv Detail & Related papers (2025-12-16T04:39:10Z) - RFX: High-Performance Random Forests with GPU Acceleration and QLORA Compression [0.0]
RFX (Random Forests X), where X stands for compression or quantization, presents a production-ready implementation of Breiman and Cutler's Random Forest classification methodology in Python.<n>RFX v1.0 provides complete classification: out-of-bag error estimation, overall and local importance measures, proximity matrices with QLORA compression, case-wise analysis, and interactive visualization (rfviz)<n> Regression, unsupervised learning, CLIQUE importance, and RF-GAP proximity are planned for v2.0.
arXiv Detail & Related papers (2025-11-23T12:00:33Z) - ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization [99.96330641363396]
ARMOR: (Adaptive Representation with Matrix-factORization) is a novel one-shot post-training pruning algorithm.<n>Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices.<n>We demonstrate ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations.
arXiv Detail & Related papers (2025-10-07T02:39:20Z) - 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float [52.079202872069835]
Large-scale AI models, such as Large Language Models (LLMs) and Diffusion Models (DMs) have grown rapidly in size.<n>We introduce Dynamic-Length Float (DFloat11), a compression framework that reduces LLM and DM size by 30% while preserving outputs that are bit-for-bit identical to the original model.
arXiv Detail & Related papers (2025-04-15T22:38:38Z) - Speedy MASt3R [68.47052557089631]
MASt3R redefines image matching as a 3D task by leveraging DUSt3R and introducing a fast reciprocal matching scheme.<n>Fast MASt3R achieves a 54% reduction in inference time (198 ms to 91 ms per image pair) without sacrificing accuracy.<n>This advancement enables real-time 3D understanding, benefiting applications like mixed reality navigation and large-scale 3D scene reconstruction.
arXiv Detail & Related papers (2025-03-13T03:56:22Z) - GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection [133.45193150403537]
Training Large Language Models (LLMs) presents significant memory challenges due to the growing size of weights and GPU states.
In this work, we propose Gradient Low-Rank Projection (GaLore) as a memory-efficient training strategy.
Our 8-bit GaLore further reduces memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline.
arXiv Detail & Related papers (2024-03-06T07:29:57Z) - DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training [82.06732962485754]
FlashAttention effectively reduces the quadratic peak memory usage to linear in training transformer-based large language models (LLMs) on a single GPU.
We introduce DISTFLASHATTN, a memory-efficient attention mechanism optimized for long-context LLMs training.
It achieves 1.67x and 1.26 - 1.88x speedup compared to recent Ring Attention and DeepSpeed-Ulysses.
arXiv Detail & Related papers (2023-10-05T03:47:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.