Related papers: Memory Analysis on the Training Course of DeepSeek Models

Memory Analysis on the Training Course of DeepSeek Models

URL: http://arxiv.org/abs/2502.07846v1
Date: Tue, 11 Feb 2025 09:51:25 GMT
Title: Memory Analysis on the Training Course of DeepSeek Models
Authors: Ping Zhang, Lei Su,
Abstract summary: We present a theoretical analysis of GPU memory consumption during the training of DeepSeek models such as DeepSeek-v2 and DeepSeek-v3.<n>It is important to emphasize that the training policies discussed in this report are not representative of DeepSeek's official configurations.
Score: 5.482535254884105
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a theoretical analysis of GPU memory consumption during the training of DeepSeek models such as DeepSeek-v2 and DeepSeek-v3. Our primary objective is to clarify the device-level memory requirements associated with various distributed training configurations. Specifically, we examine critical factors influencing memory usage, including micro-batch size, activation recomputation policies, 3D parallelism, and ZeRO optimizations. It is important to emphasize that the training policies discussed in this report are not representative of DeepSeek's official configurations. Instead, they are explored to provide a deeper understanding of memory dynamics in training of large-scale mixture-of-experts model.

Related papers

Training Plug-n-Play Knowledge Modules with Deep Context Distillation [52.94830874557649]
In this paper, we propose a way of modularizing knowledge by training document-level Knowledge Modules (KMs) KMs are lightweight components implemented as parameter-efficient LoRA modules, which are trained to store information about new documents. Our method outperforms standard next-token prediction and pre-instruction training techniques, across two datasets.
arXiv Detail & Related papers (2025-03-11T01:07:57Z)
DeepSeek-V3 Technical Report [147.16121855209246]
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token.<n>We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages.<n> Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models.
arXiv Detail & Related papers (2024-12-27T04:03:16Z)
Three Things to Know about Deep Metric Learning [34.16300515811057]
This paper addresses supervised deep metric learning for open-set image retrieval.<n>It focuses on three key aspects: the loss function, mixup regularization, and model initialization.<n>Through a systematic study of these components, we demonstrate that their synergy enables large models to nearly solve popular benchmarks.
arXiv Detail & Related papers (2024-12-17T00:49:12Z)
Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification [50.596077598766975]
We explore a memory-efficient training strategy for deep speaker embedding learning in resource-constrained scenarios.<n>For activations, we design two types of reversible neural networks which eliminate the need to store intermediate activations.<n>For states, we introduce a dynamic quantization approach that replaces the original 32-bit floating-point values with a dynamic tree-based 8-bit data type.
arXiv Detail & Related papers (2024-12-02T06:57:46Z)
Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning [64.93848182403116]
Current deep-learning memory models struggle in reinforcement learning environments that are partially observable and long-term. We introduce the Stable Hadamard Memory, a novel memory model for reinforcement learning agents. Our approach significantly outperforms state-of-the-art memory-based methods on challenging partially observable benchmarks.
arXiv Detail & Related papers (2024-10-14T03:50:17Z)
Analysis of the Memorization and Generalization Capabilities of AI Agents: Are Continual Learners Robust? [91.682459306359]
In continual learning (CL), an AI agent learns from non-stationary data streams under dynamic environments. In this paper, a novel CL framework is proposed to achieve robust generalization to dynamic environments while retaining past knowledge. The generalization and memorization performance of the proposed framework are theoretically analyzed.
arXiv Detail & Related papers (2023-09-18T21:00:01Z)
MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation [50.86932607832793]
We propose MAMo, a novel memory and attention frame-work for monocular video depth estimation.<n>In MAMo, we augment model with memory which aids the depth prediction as the model streams through the video.<n>We show that MAMo consistently improves monocular depth estimation networks and sets new state-of-the-art (SOTA) accuracy.
arXiv Detail & Related papers (2023-07-26T17:55:32Z)
An Evaluation of Memory Optimization Methods for Training Neural Networks [12.534553433992606]
Development of memory optimization methods (MOMs) has emerged as a solution to address the memory bottleneck encountered when training large models. To examine the practical value of various MOMs, we have conducted a thorough analysis of existing literature from a systems perspective. Our analysis has revealed a notable challenge within the research community: the absence of standardized metrics for effectively evaluating the efficacy of MOMs.
arXiv Detail & Related papers (2023-03-26T05:40:35Z)
Analysis of memory consumption by neural networks based on hyperparameters [0.0]
We propose a generic analysis of memory consumption while training deep learning models. The change in hyperparamaters and the number of hidden layers are the variables considered in this proposed approach.
arXiv Detail & Related papers (2021-10-21T18:49:44Z)
More Is Better: An Analysis of Instance Quantity/Quality Trade-off in Rehearsal-based Continual Learning [3.9596068699962315]
Continual Learning has become that of addressing the stability-plasticity dilemma of connectionist systems. We propose an analysis of the memory quantity/quality trade-off adopting various data reduction approaches to increase the number of instances storable in memory. Our findings suggest that the optimal trade-off is severely skewed toward instance quantity, where rehearsal approaches with several heavily compressed instances easily outperform state-of-the-art approaches.
arXiv Detail & Related papers (2021-05-28T21:05:51Z)
Memory-based Deep Reinforcement Learning for POMDP [7.137228786549488]
Long-Short-Term-Memory-based Twin Delayed Deep Deterministic Policy Gradient (LSTM-TD3) Our results demonstrate the significant advantages of the memory component in addressing Partially Observable MDPs.
arXiv Detail & Related papers (2021-02-24T15:25:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.