Related papers: On Efficient Constructions of Checkpoints

Related papers

KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding [72.12756830560217]
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI.<n>Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value cache during inference has emerged as a primary efficiency bottleneck.<n>By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed.
arXiv Detail & Related papers (2025-07-15T12:52:12Z)
An Efficient Compression of Deep Neural Network Checkpoints Based on Prediction and Context Modeling [1.7495213911983414]
We propose a prediction-based compression approach, where values from the previously saved checkpoint are used for context modeling in arithmetic coding.<n> Experimental results show that our approach achieves substantial bit size reduction, while enabling near-lossless training recovery from restored checkpoints.
arXiv Detail & Related papers (2025-06-13T17:54:42Z)
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [81.81027217759433]
Large language models (LLMs) are often constrained by the excessive memory required to store the Key-Value ( KV) cache.<n>Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers.<n>We propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache.
arXiv Detail & Related papers (2025-05-30T08:49:27Z)
On the Convergence of DP-SGD with Adaptive Clipping [56.24689348875711]
Gradient Descent with gradient clipping is a powerful technique for enabling differentially private optimization. This paper provides the first comprehensive convergence analysis of SGD with quantile clipping (QC-SGD) We show how QC-SGD suffers from a bias problem similar to constant-threshold clipped SGD but can be mitigated through a carefully designed quantile and step size schedule.
arXiv Detail & Related papers (2024-12-27T20:29:47Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development [9.13331802151585]
ByteCheckpoint is an industrial-grade checkpointing system for large-scale LFM training. ByteCheckpoint significantly reduces checkpoint stalls, achieving an average reduction of 54.20x. For saving and loading times, ByteCheckpoint achieves improvements of up to 9.96x and 8.80x, respectively.
arXiv Detail & Related papers (2024-07-29T16:18:20Z)
FastPersist: Accelerating Model Checkpointing in Deep Learning [21.308403847800573]
We propose FastPersist to accelerate checkpoint creation in Deep Learning (DL) training. FastPersist combines three novel techniques: (i) optimizations for faster checkpoint writes to inference, (ii) efficient write parallelism using the available in training environments, and (iii) overlapping checkpointing with independent training computations. Our evaluation shows that FastPersist creates checkpoints in persistent storage up to 116x faster than baseline, and enables per-iteration checkpointing with negligible overhead.
arXiv Detail & Related papers (2024-06-19T18:31:23Z)
ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking [39.02269424136506]
Large language models (LLM) have recently attracted significant attention in the field of artificial intelligence. We propose a novel Extreme Checkpoint Compression (ExCP) framework, which significantly reduces the required storage of training checkpoints. We extensively evaluate our proposed ExCP framework on several models ranging from 410M to 7B parameters and demonstrate significant storage reduction while maintaining strong performance.
arXiv Detail & Related papers (2024-06-17T06:47:29Z)
Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better [31.67038902035949]
Diffusion Models (DM) and Consistency Models (CM) are two types of popular generative models with good generation quality on various tasks. In this work, we find that high-quality model weights often lie in a basin which cannot be reached by SGD but can be obtained by proper checkpoint averaging. We propose LCSC, a simple but effective and efficient method to enhance the performance of DM and CM, by combining checkpoints along the training trajectory with coefficients deduced from evolutionary search.
arXiv Detail & Related papers (2024-04-02T18:59:39Z)
GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding [81.01996600734616]
We introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding. GliDe is a modified draft model architecture that reuses the cached keys and values from the target LLM. We will release our code, data, and the trained draft models.
arXiv Detail & Related papers (2024-02-03T08:44:11Z)
RLSAC: Reinforcement Learning enhanced Sample Consensus for End-to-End Robust Estimation [74.47709320443998]
We propose RLSAC, a novel Reinforcement Learning enhanced SAmple Consensus framework for end-to-end robust estimation. RLSAC employs a graph neural network to utilize both data and memory features to guide exploring directions for sampling the next minimum set. Our experimental results demonstrate that RLSAC can learn from features to gradually explore a better hypothesis.
arXiv Detail & Related papers (2023-08-10T03:14:19Z)
Inshrinkerator: Compressing Deep Learning Training Checkpoints via Dynamic Quantization [5.648270790530862]
State-of-the-art approaches involve lossy model compression mechanisms, which induce a tradeoff between the resulting model quality (accuracy) and compression ratio. We make a key enabling observation that the sensitivity of model weights to compression varies during training, and different weights benefit from different quantization levels. We propose a non-uniform quantization scheme that leverages this variation, an efficient search mechanism that dynamically finds the best quantization configurations, and a quantization-aware delta compression mechanism that rearranges weights to minimize checkpoint differences.
arXiv Detail & Related papers (2023-06-20T18:00:31Z)
Grad-PU: Arbitrary-Scale Point Cloud Upsampling via Gradient Descent with Learned Distance Functions [77.32043242988738]
We propose a new framework for accurate point cloud upsampling that supports arbitrary upsampling rates. Our method first interpolates the low-res point cloud according to a given upsampling rate.
arXiv Detail & Related papers (2023-04-24T06:36:35Z)
GPCO: An Unsupervised Green Point Cloud Odometry Method [64.86292006892093]
A lightweight point cloud odometry solution is proposed and named the green point cloud odometry (GPCO) method. GPCO is an unsupervised learning method that predicts object motion by matching features of consecutive point cloud scans. It is observed that GPCO outperforms benchmarking deep learning methods in accuracy while it has a significantly smaller model size and less training time.
arXiv Detail & Related papers (2021-12-08T00:24:03Z)
Layer Pruning on Demand with Intermediate CTC [50.509073206630994]
We present a training and pruning method for ASR based on the connectionist temporal classification (CTC) We show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU.
arXiv Detail & Related papers (2021-06-17T02:40:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.