Related papers: Alchemist: Towards the Design of Efficient Online Continual Learning System

Alchemist: Towards the Design of Efficient Online Continual Learning System

URL: http://arxiv.org/abs/2503.01066v2
Date: Fri, 14 Mar 2025 16:57:12 GMT
Title: Alchemist: Towards the Design of Efficient Online Continual Learning System
Authors: Yuyang Huang, Yuhan Liu, Haryadi S. Gunawi, Beibin Li, Changho Hwang,
Abstract summary: We propose Alchemist, to the best of our knowledge, the first online continual learning system that efficiently reuses serving activations to increase training throughput.<n>Alchemy significantly increases training throughput by up to 1.72x, reduces up to 47% memory usage during training, and supports up to 2x more training tokens.
Score: 15.224901317189728
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Continual learning has become a promising solution to refine large language models incrementally by leveraging user feedback. In particular, online continual learning - iteratively training the model with small batches of user feedback - has demonstrated notable performance improvements. However, the existing practice of separating training and serving processes forces the online trainer to recompute the intermediate results already done during serving. Such redundant computations can account for 30%-42% of total training time. In this paper, we propose Alchemist, to the best of our knowledge, the first online continual learning system that efficiently reuses serving activations to increase training throughput. Alchemist introduces two key techniques: (1) recording and storing activations and KV cache only during the prefill phase to minimize latency and memory overhead; and (2) smart activation offloading and hedging. Evaluations with inputs of varied token length sampled from ShareGPT dataset show that compared with a separate training cluster, Alchemist significantly increases training throughput by up to 1.72x, reduces up to 47% memory usage during training, and supports up to 2x more training tokens - all while maintaining negligible impact on serving latency.

Related papers

HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration [31.982294870690925]
We propose a novel learning-based caching framework dubbed HarmoniCa.<n>It incorporates Step-Wise Denoising Training (SDT) to ensure the continuity of the denoising process.<n>It also incorporates an Image Error Proxy-Guided Objective (IEPO) to balance image quality against cache utilization.
arXiv Detail & Related papers (2024-10-02T16:34:29Z)
Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena [126.70522244144088]
We introduce Arena Learning, an innovative offline strategy designed to simulate arena battles using AI-driven annotations. Arena Learning ensures precise evaluations and maintains consistency between offline simulations and online competitions. We apply Arena Learning to train our target model, WizardLM-$beta$, and demonstrate significant performance enhancements.
arXiv Detail & Related papers (2024-07-15T11:26:07Z)
ProTrain: Efficient LLM Training via Memory-Aware Techniques [18.30799115938978]
This paper proposes ProTrain, a novel training system that balances memory usage and performance by coordinating memory, computation, and IO. ProTrain improves training throughput by 1.43$times$ to 2.71$times compared to the SOTA training systems.
arXiv Detail & Related papers (2024-06-12T15:40:06Z)
Revisiting the Power of Prompt for Visual Tuning [50.11465784194896]
This study explores the correlation evolvement between prompts and patch tokens during proficient training. Inspired by the observation that the prompt tokens tend to share high mutual information with patch tokens, we propose initializing prompts with downstream token prototypes. Our method significantly advances the adaptation for self-supervised pretraining, achieving impressive task performance gains of at least 10% to 30%.
arXiv Detail & Related papers (2024-02-04T07:49:02Z)
Fast Machine Unlearning Without Retraining Through Selective Synaptic Dampening [51.34904967046097]
Selective Synaptic Dampening (SSD) is a fast, performant, and does not require long-term storage of the training data. We present a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data.
arXiv Detail & Related papers (2023-08-15T11:30:45Z)
Adaptive Cross Batch Normalization for Metric Learning [75.91093210956116]
Metric learning is a fundamental problem in computer vision. We show that it is equally important to ensure that the accumulated embeddings are up to date. In particular, it is necessary to circumvent the representational drift between the accumulated embeddings and the feature embeddings at the current training iteration.
arXiv Detail & Related papers (2023-03-30T03:22:52Z)
Task-oriented Memory-efficient Pruning-Adapter [3.0751447761822903]
We propose a task-oriented Pruning-Adapter method that achieve a high memory efficiency of training and memory. No significant decrease in accuracy in GLUE tasks, achieving training and inference efficiency at the same time.
arXiv Detail & Related papers (2023-03-26T12:18:00Z)
EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones [80.662250618795]
This paper presents a new curriculum learning approach for the efficient training of visual backbones (e.g., vision Transformers) As an off-the-shelf method, it reduces the wall-time training cost of a wide variety of popular models by >1.5x on ImageNet-1K/22K without sacrificing accuracy.
arXiv Detail & Related papers (2022-11-17T17:38:55Z)
Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores [4.774170751209782]
We show how ML training benefits from storage pushdowns by focusing on transfer learning (TL) We propose HAPI, a new TL processing system centered around two complementary techniques that address challenges introduced by disaggregation.
arXiv Detail & Related papers (2022-10-16T22:28:36Z)
DIVISION: Memory Efficient Training via Dual Activation Precision [60.153754740511864]
State-of-the-art work combines a search of quantization bit-width with the training, which makes the procedure complicated and less transparent. We propose a simple and effective method to compress DNN training. Experiment results show DIVISION has better comprehensive performance than state-of-the-art methods, including over 10x compression of activation maps and competitive training throughput, without loss of model accuracy.
arXiv Detail & Related papers (2022-08-05T03:15:28Z)
Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)
Recursive Least-Squares Estimator-Aided Online Learning for Visual Tracking [58.14267480293575]
We propose a simple yet effective online learning approach for few-shot online adaptation without requiring offline training. It allows an in-built memory retention mechanism for the model to remember the knowledge about the object seen before. We evaluate our approach based on two networks in the online learning families for tracking, i.e., multi-layer perceptrons in RT-MDNet and convolutional neural networks in DiMP.
arXiv Detail & Related papers (2021-12-28T06:51:18Z)
Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers. Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)
Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training [18.640076155697415]
We present a study of a curriculum learning based approach, which helps improve the pre-training convergence speed of autoregressive models. Our evaluations demonstrate that curriculum learning enables training GPT-2 models with 8x larger batch size and 4x larger learning rate.
arXiv Detail & Related papers (2021-08-13T06:32:53Z)
Dynamic Sparse Training for Deep Reinforcement Learning [36.66889208433228]
We propose for the first time to dynamically train deep reinforcement learning agents with sparse neural networks from scratch. Our approach is easy to be integrated into existing deep reinforcement learning algorithms. We evaluate our approach on OpenAI gym continuous control tasks.
arXiv Detail & Related papers (2021-06-08T09:57:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.