Memory Analysis on the Training Course of DeepSeek Models
- URL: http://arxiv.org/abs/2502.07846v1
- Date: Tue, 11 Feb 2025 09:51:25 GMT
- Title: Memory Analysis on the Training Course of DeepSeek Models
- Authors: Ping Zhang, Lei Su,
- Abstract summary: We present a theoretical analysis of GPU memory consumption during the training of DeepSeek models such as DeepSeek-v2 and DeepSeek-v3.
It is important to emphasize that the training policies discussed in this report are not representative of DeepSeek's official configurations.
- Score: 5.482535254884105
- License:
- Abstract: We present a theoretical analysis of GPU memory consumption during the training of DeepSeek models such as DeepSeek-v2 and DeepSeek-v3. Our primary objective is to clarify the device-level memory requirements associated with various distributed training configurations. Specifically, we examine critical factors influencing memory usage, including micro-batch size, activation recomputation policies, 3D parallelism, and ZeRO optimizations. It is important to emphasize that the training policies discussed in this report are not representative of DeepSeek's official configurations. Instead, they are explored to provide a deeper understanding of memory dynamics in training of large-scale mixture-of-experts model.
Related papers
- DeepSeek-V3 Technical Report [147.16121855209246]
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token.
We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages.
Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models.
arXiv Detail & Related papers (2024-12-27T04:03:16Z) - Three Things to Know about Deep Metric Learning [34.16300515811057]
This paper addresses supervised deep metric learning for open-set image retrieval.
It focuses on three key aspects: the loss function, mixup regularization, and model initialization.
Through a systematic study of these components, we demonstrate that their synergy enables large models to nearly solve popular benchmarks.
arXiv Detail & Related papers (2024-12-17T00:49:12Z) - Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification [50.596077598766975]
We explore a memory-efficient training strategy for deep speaker embedding learning in resource-constrained scenarios.
For activations, we design two types of reversible neural networks which eliminate the need to store intermediate activations.
For states, we introduce a dynamic quantization approach that replaces the original 32-bit floating-point values with a dynamic tree-based 8-bit data type.
arXiv Detail & Related papers (2024-12-02T06:57:46Z) - DepthSplat: Connecting Gaussian Splatting and Depth [90.06180236292866]
We present DepthSplat to connect Gaussian splatting and depth estimation.
We first contribute a robust multi-view depth model by leveraging pre-trained monocular depth features.
We also show that Gaussian splatting can serve as an unsupervised pre-training objective.
arXiv Detail & Related papers (2024-10-17T17:59:58Z) - Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning [64.93848182403116]
Current deep-learning memory models struggle in reinforcement learning environments that are partially observable and long-term.
We introduce the Stable Hadamard Memory, a novel memory model for reinforcement learning agents.
Our approach significantly outperforms state-of-the-art memory-based methods on challenging partially observable benchmarks.
arXiv Detail & Related papers (2024-10-14T03:50:17Z) - Analysis of the Memorization and Generalization Capabilities of AI
Agents: Are Continual Learners Robust? [91.682459306359]
In continual learning (CL), an AI agent learns from non-stationary data streams under dynamic environments.
In this paper, a novel CL framework is proposed to achieve robust generalization to dynamic environments while retaining past knowledge.
The generalization and memorization performance of the proposed framework are theoretically analyzed.
arXiv Detail & Related papers (2023-09-18T21:00:01Z) - MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation [50.86932607832793]
We propose MAMo, a novel memory and attention frame-work for monocular video depth estimation.
In MAMo, we augment model with memory which aids the depth prediction as the model streams through the video.
We show that MAMo consistently improves monocular depth estimation networks and sets new state-of-the-art (SOTA) accuracy.
arXiv Detail & Related papers (2023-07-26T17:55:32Z) - Analysis of memory consumption by neural networks based on
hyperparameters [0.0]
We propose a generic analysis of memory consumption while training deep learning models.
The change in hyperparamaters and the number of hidden layers are the variables considered in this proposed approach.
arXiv Detail & Related papers (2021-10-21T18:49:44Z) - More Is Better: An Analysis of Instance Quantity/Quality Trade-off in
Rehearsal-based Continual Learning [3.9596068699962315]
Continual Learning has become that of addressing the stability-plasticity dilemma of connectionist systems.
We propose an analysis of the memory quantity/quality trade-off adopting various data reduction approaches to increase the number of instances storable in memory.
Our findings suggest that the optimal trade-off is severely skewed toward instance quantity, where rehearsal approaches with several heavily compressed instances easily outperform state-of-the-art approaches.
arXiv Detail & Related papers (2021-05-28T21:05:51Z) - Memory-based Deep Reinforcement Learning for POMDP [7.137228786549488]
Long-Short-Term-Memory-based Twin Delayed Deep Deterministic Policy Gradient (LSTM-TD3)
Our results demonstrate the significant advantages of the memory component in addressing Partially Observable MDPs.
arXiv Detail & Related papers (2021-02-24T15:25:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.