Related papers: Adacc: An Adaptive Framework Unifying Compression and Activation Recomputation for LLM Training

Adacc: An Adaptive Framework Unifying Compression and Activation Recomputation for LLM Training

URL: http://arxiv.org/abs/2508.00806v2
Date: Fri, 08 Aug 2025 09:49:52 GMT
Title: Adacc: An Adaptive Framework Unifying Compression and Activation Recomputation for LLM Training
Authors: Ping Chen, Zhuohong Deng, Ping Li, Shuibing He, Hongzi Zhu, Yi Zheng, Zhefeng Wang, Baoxing Huai, Minyi Guo,
Abstract summary: Training large language models (LLMs) is often constrained by GPU memory limitations.<n>Adacc is the first adaptive memory optimization framework that unifies activation recomputation and data compression.<n>Adacc improves training throughput by 1.01x to 1.37x compared to state-of-the-art frameworks.
Score: 40.371351103295765
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training large language models (LLMs) is often constrained by GPU memory limitations. To alleviate memory pressure, activation recomputation and data compression have been proposed as two major strategies. However, both approaches have limitations: recomputation introduces significant training overhead, while compression can lead to accuracy degradation and computational inefficiency when applied naively. In this paper, we propose Adacc, the first adaptive memory optimization framework that unifies activation recomputation and data compression to improve training efficiency for LLMs while preserving model accuracy. Unlike existing methods that apply static, rule-based strategies or rely solely on one technique, Adacc makes fine-grained, tensor-level decisions, dynamically selecting between recomputation, retention, and compression based on tensor characteristics and runtime hardware constraints. Adacc tackles three key challenges: (1) it introduces layer-specific compression algorithms that mitigate accuracy loss by accounting for outliers in LLM activations; (2) it employs a MILP-based scheduling policy to globally optimize memory strategies across layers; and (3) it integrates an adaptive policy evolution mechanism to update strategies during training in response to changing data distributions. Experimental results show that Adacc improves training throughput by 1.01x to 1.37x compared to state-of-the-art frameworks, while maintaining accuracy comparable to the baseline.

Related papers

BitSnap: Checkpoint Sparsification and Quantization in LLM Training [12.420041526429287]
Large language models (LLMs) continue to grow in size and complexity.<n> efficient checkpoint saving&loading has become crucial for managing storage, memory usage, and fault tolerance in LLM training.<n>This paper proposes a novel checkpoint sparsification and quantization method that adapts dynamically to different training stages and model architectures.
arXiv Detail & Related papers (2025-11-15T22:48:59Z)
Rethinking Autoregressive Models for Lossless Image Compression via Hierarchical Parallelism and Progressive Adaptation [75.58269386927076]
Autoregressive (AR) models are often dismissed as impractical due to prohibitive computational cost.<n>This work re-thinks this paradigm, introducing a framework built on hierarchical parallelism and progressive adaptation.<n> Experiments on diverse datasets (natural, satellite, medical) validate that our method achieves new state-of-the-art compression.
arXiv Detail & Related papers (2025-11-14T06:27:58Z)
EDGC: Entropy-driven Dynamic Gradient Compression for Efficient LLM Training [11.898292264147427]
Training large language models (LLMs) poses significant challenges regarding computational resources and memory capacity.<n>Existing approaches primarily rely on static gradient compression to enhance communication efficiency.<n>We propose an entropy-driven dynamic gradient compression framework called EDGC.
arXiv Detail & Related papers (2025-11-13T14:05:34Z)
Sig2Model: A Boosting-Driven Model for Updatable Learned Indexes [6.133666849556217]
Sig2Model is an efficient and adaptive learned index that minimizes retraining cost through three key techniques.<n>We show that Sig2Model reduces retraining cost by up to 20x, achieves up to 3x higher QPS, and uses up to 1000x less memory.
arXiv Detail & Related papers (2025-09-25T06:07:13Z)
CALR: Corrective Adaptive Low-Rank Decomposition for Efficient Large Language Model Layer Compression [0.0]
Large Language Models (LLMs) present significant deployment challenges due to their immense size and computational requirements.<n>We introduce Corrective Adaptive Low-Rank Decomposition (CALR), a two-component compression approach.<n>We show that CALR can reduce parameter counts by 26.93% to 51.77% while retaining 59.45% to 90.42% of the original model's performance.
arXiv Detail & Related papers (2025-08-21T13:16:02Z)
Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence [131.41894248194995]
We propose context-oriented decomposition adaptation (CorDA), a novel method that initializes adapters in a task-aware manner.<n>Thanks to the task awareness, our method enables two optional adaptation modes, knowledge-preserved mode (KPM) and instruction-previewed mode (IPM)
arXiv Detail & Related papers (2025-06-16T07:55:14Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models.<n>We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation.<n>Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads.<n>We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z)
EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation [84.70637613266835]
EoRA is a fine-tuning-free method that augments compressed Large Language Models with low-rank matrices.<n>EoRA consistently outperforms prior training-free low rank methods in recovering the accuracy of compressed LLMs.
arXiv Detail & Related papers (2024-10-28T17:59:03Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting [12.006890185810322]
We introduce a computation- and memory-efficient LLM tuning framework, called Edge-LLM, to facilitate affordable and effective LLM adaptation on edge devices. Specifically, Edge-LLM features three core components: (1) a layer-wise unified compression (LUC) technique to reduce the computation overhead by generating layer-wise pruning sparsity and quantization bit-width policies, (2) an adaptive layer tuning and voting scheme to reduce the memory overhead by reducing the backpropagation depth, and (3) a complementary hardware scheduling strategy to handle the irregular computation patterns introduced by LUC and adaptive layer tuning.
arXiv Detail & Related papers (2024-06-22T06:51:47Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.<n>We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks.<n>Experiments conducted on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral models demonstrate the promising performance of our method in efficiency and effectiveness.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
AdpQ: A Zero-shot Calibration Free Adaptive Post Training Quantization Method for LLMs [22.25748046511075]
AdpQ is a novel zero-shot adaptive PTQ method for Large Language Models (LLMs) It achieves the state-of-the-art performance in low-precision quantization without requiring any calibration data. Our results achieve the same accuracy as the existing methods on various LLM benchmarks while the quantization time is reduced by at least 10x.
arXiv Detail & Related papers (2024-05-22T05:32:11Z)
Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models [37.492637804756164]
We introduce a versatile adaptation approach that can effectively work under all three settings. We propose the dual memory networks that comprise dynamic and static memory components. Our approach is tested across 11 datasets under the three task settings.
arXiv Detail & Related papers (2024-03-26T10:54:07Z)
Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval. We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z)
Fine-Tuning Language Models with Just Forward Passes [92.04219196752007]
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a large amount of memory. We propose a memory-efficient zerothorder (MeZO) to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference.
arXiv Detail & Related papers (2023-05-27T02:28:10Z)
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement [24.108008515395458]
We propose APE, an Adaptive Prior rEfinement method for CLIP's pre-trained knowledge, which achieves superior accuracy with high computational efficiency. For the average accuracy over 11 benchmarks, both APE and APE-T attain state-of-the-art and respectively outperform the second-best by +1.59% and +1.99% under 16 shots with x30 less learnable parameters.
arXiv Detail & Related papers (2023-04-03T17:58:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.