We've obtained and evaluated a pre-print DeepSeek Technical Report....
DeepSeek-V3: Core Contributions and Characteristics
This report details DeepSeek-V3, a Mixture-of-Experts (MoE) language model with a total of 671 billion parameters, where 37 billion are activated for each token. The model's design prioritizes efficient inference and cost-effective training, incorporating specific architectural components and training strategies.
Architectural Innovations:
- Multi-head Latent Attention (MLA): DeepSeek-V3 utilizes MLA, which aims to reduce Key-Value (KV) cache during inference through a low-rank compression for attention keys and values. This technique involves compressing the latent vectors for queries, keys, and values, which can be cached during inference. The caching significantly reduces the memory footprint while maintaining performance.
- DeepSeekMoE with Auxiliary-Loss-Free Load Balancing: The model employs the DeepSeekMoE architecture, using finer-grained experts and isolating some as shared. It introduces an auxiliary-loss-free load balancing strategy to minimize performance degradation caused by imbalanced expert load, which occurs with MoE training. This strategy avoids using conventional auxiliary losses and instead employs a dynamic bias term added to affinity scores to distribute the load. There is also a sequence-wise auxiliary loss to prevent imabalance within a single sequence.
- Multi-Token Prediction (MTP): The model incorporates a multi-token prediction objective, extending the prediction scope to multiple future tokens at each position. The implementation uses sequential modules to predict additional tokens and keeps the causal chain at each prediction depth. During inference, MTP modules can be discarded to function normally, or used to improve latency via speculative decoding.
Infrastructure and Training Framework:
- Compute Infrastructure: DeepSeek-V3 was trained on a cluster equipped with 2048 NVIDIA H800 GPUs. GPUs are connected by NVLink within nodes and by InfiniBand (IB) across nodes.
- DualPipe Algorithm: A pipeline parallelism method named DualPipe is used, which overlaps the computation and communication across forward and backward passes. This method divides the computation into components and rearranges them with manual adjustment to ensure that communication is hidden during execution.
- Cross-Node All-to-All Communication: The authors implement custom kernels for cross-node all-to-all communication, leveraging IB and NVLink. A node-limited routing mechanism limits the number of receiving nodes for each token, using only 20 SMs to implement all-to-all communication.
- Memory Saving Techniques: Several methods are employed to reduce memory usage, including recomputing RMSNorm and MLA up-projections during back-propagation, storing the exponential moving average (EMA) of model parameters in CPU memory, and sharing the embedding and output head between modules.
- FP8 Training: The model leverages a fine-grained mixed precision framework using the FP8 data format to accelerate training and reduce GPU memory usage. Techniques are introduced to ensure high precision, including a tile-wise or block-wise quantization strategy to handle feature outliers, and a promotion of GEMM operations to CUDA cores. Also, they retain FP32 and BF16 for key components of the architecture.
Pre-Training and Post-Training:
- Data: DeepSeek-V3 was pre-trained on 14.8 trillion tokens with a new tokenizer and an emphasis on mathematical and programming samples. The pre-training process also includes a Fill-in-Middle (FIM) approach to enable bi-directional text processing.
- Context Length Extension: Two additional training stages were used to extend the context length, first to 32K and then to 128K. This was achieved via the application of a YaRN technique.
- Post-Training: Post-training includes both supervised fine-tuning (SFT) and reinforcement learning (RL), where reasoning capabilities are distilled from the DeepSeek-R1 series of models. The RL includes a rule-based reward and model-based rewards. A Group Relative Policy Optimization (GRPO) technique is used for aligning the policy with human preferences.
Evaluation and Performance:
- Benchmarks: The report provides evaluations across various benchmarks, including educational benchmarks like MMLU, MMLU-Pro, and GPQA, factuality benchmarks like SimpleQA, code-related benchmarks like HumanEval and LiveCodeBench, and math-related benchmarks like GSM8K, MATH, and MGSM. It also provides results for several language benchmarks.
- Findings: The evaluations present that DeepSeek-V3 performs well on a range of benchmarks, and achieves results that are comparable to or outperform state-of-the-art open source base models. For specific benchmarks like MATH, DeepSeek-V3 has been shown to achieve state-of-the-art performance. The model is also competitive with closed-source models.
Key Technical Details:
- Model Size: 671B total parameters, 37B activated per token.
- Training Data: 14.8T tokens of diverse and high-quality data.
- FP8 Training Implementation The majority of the tensor computations are done in FP8, with the embedding module, output head, MoE gating, normalisation, and attention operators running in original precision. A tile and block-wise scaling method is used.
- Cost of Training: The report suggests that DeepSeek V3 was trained for 2.788M H800 GPU hours. This was achieved via several algorithmic optimizations.
Limitations and Future Directions:
- Deployment: The recommended deployment unit is relatively large, which may be a challenge for smaller teams.
- Scalability: There are limitations for scalability of model inference and deployment.
- Future Directions: Further exploration and optimization of model architectures. Continuous iteration on the quantity and quality of the training data, including additional data sources. Exploration of deeper reasoning capabilities and problem-solving skills in models. Development of more comprehensive and multi-dimensional model evaluation methods to prevent bias towards fixed benchmarks.
Credibility and transparency :
- MLA and DeepSeekMoE: The use of Multi-Head Latent Attention (MLA) and DeepSeekMoE are consistent with the authors' prior publications. These architectures are relatively innovative approaches to improving efficiency.
- FP8 Training: The report’s adoption of a custom FP8 framework in training matches current trends in research regarding low-precision training methods. The improvements they claim are somewhat consistent with other papers in the field.
- Multi-Token Prediction (MTP): The use of MTP to densify the training signal and possibly help speculative decoding is somewhat innovative, drawing inspiration from other researchers in the field.
- DualPipe and Low-Precision Training: Their implementation of pipeline parallelism via DualPipe and associated communication optimizations, alongside low-precision training (FP8), are consistent with trends in large-scale model training and optimization research.
- Distillation: The distillation approach from DeepSeek-R1 is an innovative approach to improve instruction-following ability and reasoning, and does not seem atypical from related studies.
- Performance: The performance claims reported are relatively consistent with expected outcomes in comparison with other major LLMs and appear impressive in the math and code domains.
- Training Cost: The economical training costs are emphasized which indicates that this is an important selling point. The details they provide allow the reader to verify the claim for themselves.
Gap Analysis
Here's where the report leaves some room for more information:
- Dataset Details: While the report states that the model was trained on 14.8T diverse and high-quality tokens, it could provide more information on the composition of this dataset. The source of the dataset, preprocessing methods, or class composition (relative amounts of code, text, math, etc) could be useful to verify the claim.
- More Granular Ablation Studies: Ablation studies on architectural components, training optimizations, and specific parameters are somewhat limited. While there is an ablation study for load balancing and MTP, there is little discussion of the effects of specific parameters or aspects of the training system. For example, the precise impact of a specific hyper-parameter on the performance was not mentioned.
- Safety and Bias: The report primarily focuses on performance, efficiency, and technical specifications. It should address important aspects of fairness, ethical considerations, and potential biases of the model. While this is a base model, it is important to know how it may be used in other downstream applications.
- Quantized Model Checkpoints: While the weights in the github repo is provided in BF16, there are no official details of the performance of quantized versions for the community to test.
- Reproducibility - While the report suggests that the training process is remarkably stable, it does not provide details on the exact set of seeds, environments, or hardware setups required to ensure complete reproducibility of results.
- Training Details: Given the highly technical nature of this paper, it would be useful to have more information about their training implementation. For example, what was the exact communication scheme used in the experts?
- Memory Layout and Overhead: While there is some discussion about the memory layout, it would be useful to have a more granular breakdown of the memory used for each module of the training process.
- Detailed analysis of all benchmarks: While there are standard benchmarks that are evaluated on, it would be useful to have a more complete understanding of all benchmarks evaluated on, even those that were found to perform poorly.
Unaddressed Questions
- How does the model behave on specific edge cases or adversarial examples?
- How adaptable is the model to new domains or languages beyond what it was trained on?
- What are the computational trade-offs of enabling the MTP prediction at inference for speculative decoding, and how well does this actually improve throughput in practice?
- Are there specific types of tasks where the auxiliary-loss-free balancing method doesn't work as well as auxiliary-loss-based strategies?
- What are the implications for deploying this model on a small number of GPUs/nodes?
- How does the model perform on tasks that specifically require multi-turn reasoning, given that the distillation approach relies on an existing R1 model?
- Given that the model has 671B parameters, how is it being stored, and what are the associated cost tradeoffs?
• Is there any difference in training the model with different optimizers or specific learning rate schemes?