DeepseekV3Config, because the Config configuration of DeepSeek-R1-Zero model continues to use the Config configuration of DeepseekV3 model
| Parameter | Default value | Explanation |
|---|---|---|
| vocab_size | 129280 | Vocabulary size |
| hidden_size | 7168 | Transformer hidden layer dimension |
| intermediate_size | 18432 | MLP intermediate layer dimension |
| moe_intermediate_size | 2048 | Intermediate dimension of MoE expert layer |
| num_hidden_layers | 61 | Transformer layer |
| num_attention_heads | 128 | Head of attention |
| num_key_value_heads | 128 | Number of key-value headers (GQA configuration) |
| max_position_embeddings | 4096 | Maximum sequence length |
| hidden_act | "silu" | Activation function (SiLU) |
| rms_norm_eps | 1.00E-06 | RMS normalized epsilon |
| rope_theta | 10000 | Fundamental frequency of RoPE rotary position coding |
| rope_scaling | None | RoPE's scaling strategy (supports linear/dynamic) |
| Parameter | Default value | Explanation |
|---|---|---|
| n_shared_experts | 1 | Number of shared experts (None for fully dense models) |
| n_routed_experts | 256 | Number of routing experts |
| num_experts_per_toknum_specialists_per_tok | 8 | Number of experts selected per token |
| moe_layer_freq | 1 | MoE layer frequency (insert 1 MoE layer per moe_layer_freq layer) |
| first_k_dense_replace | 3 | The first k layers use dense layers instead of MoE. |
| topk_method | "noaux_tc" | Expert selection method |
| n_group | 8 | Number of expert groups |
| topk_group | 4 | Number of expert groups selected per token |
| routed_scaling_factor | 2.5 | Scaling factor for routing experts |
| norm_topk_prob | TRUE | Whether to normalize expert weights |
| scoring_func | "sigmoid" | Expert weight calculation method |
| Parameter | Default value | Explanation |
|---|---|---|
| num_key_value_heads | 128 | Number of key-value headers (affecting GQA/MQA/MHA) |
| attention_bias | FALSE | Whether to use bias in Q/K/V projection |
| attention_dropout | 0 | Attention probability dropout rate |
| q_lora_rank | 1536 | LoRA low-rank dimension of query (Q) |
| kv_lora_rank | 512 | LoRA low-rank dimension of key-value (K/V) |
| qk_rope_head_dim | 64 | Q/K head dimension using RoPE |
| v_head_dim | 128 | Head dimension of value (V) |
| qk_nope_head_dim | 128 | head dimension without using RoPE |
| Parameter | Default value | Explanation |
|---|---|---|
| initializer_range | 0.02 | Weight Initialization Range |
| use_cache | TRUE | Whether to cache KV status (for generating tasks) |
| pad_token_id | None | IDFill token ID |
| bos_token_id | 0 | IDSentence start token ID |
| eos_token_id | 1 | IDEnd of sentence token ID |
| tie_word_embeddings | FALSE | Whether to bind input/output word embeddings |
Core Features of DeepSeek-R1-Zero
① Super Long Context Support
② Efficient MoE Architecture
n_group=8), with 8 experts activated per token.③ Low-Rank Adaptation (LoRA)
④ FP8 Quantized Inference
⑤ GQA Attention
num_key_value_heads=128 (equivalent to MHA), retaining full attention capability.Comparison with DeepSeek-V3
| Feature | DeepSeek-V3 | DeepSeek-R1-Zero |
|---|---|---|
| Max Context Length | 4,096 | 163,840 |
| RoPE Scaling | Basic linear/dynamic | YaRN-optimized |
| MoE Experts | Same configuration | Same configuration |
| LoRA Configuration | Same configuration | Same configuration |
| Quantization Support | None | FP8 Dynamic Quantization |
| Primary Use Case | General pre-training | Long-context optimization |
DeepSeek-R1-Zero builds upon DeepSeek-V3, prioritizing optimizations for long-context processing and inference efficiency (FP8 quantization), making it ideal for scenarios requiring ultra-long text understanding.