Config configuration used

DeepseekV3Config, because the Config configuration of DeepSeek-R1-Zero model continues to use the Config configuration of DeepseekV3 model

  1. Core architecture parameters
Parameter Default value Explanation
vocab_size 129280 Vocabulary size
hidden_size 7168 Transformer hidden layer dimension
intermediate_size 18432 MLP intermediate layer dimension
moe_intermediate_size 2048 Intermediate dimension of MoE expert layer
num_hidden_layers 61 Transformer layer
num_attention_heads 128 Head of attention
num_key_value_heads 128 Number of key-value headers (GQA configuration)
max_position_embeddings 4096 Maximum sequence length
hidden_act "silu" Activation function (SiLU)
rms_norm_eps 1.00E-06 RMS normalized epsilon
rope_theta 10000 Fundamental frequency of RoPE rotary position coding
rope_scaling None RoPE's scaling strategy (supports linear/dynamic)
  1. MoE (Hybrid Expert) configuration
Parameter Default value Explanation
n_shared_experts 1 Number of shared experts (None for fully dense models)
n_routed_experts 256 Number of routing experts
num_experts_per_toknum_specialists_per_tok 8 Number of experts selected per token
moe_layer_freq 1 MoE layer frequency (insert 1 MoE layer per moe_layer_freq layer)
first_k_dense_replace 3 The first k layers use dense layers instead of MoE.
topk_method "noaux_tc" Expert selection method
n_group 8 Number of expert groups
topk_group 4 Number of expert groups selected per token
routed_scaling_factor 2.5 Scaling factor for routing experts
norm_topk_prob TRUE Whether to normalize expert weights
scoring_func "sigmoid" Expert weight calculation method
  1. Attention Mechanism & LoRA Configuration
Parameter Default value Explanation
num_key_value_heads 128 Number of key-value headers (affecting GQA/MQA/MHA)
attention_bias FALSE Whether to use bias in Q/K/V projection
attention_dropout 0 Attention probability dropout rate
q_lora_rank 1536 LoRA low-rank dimension of query (Q)
kv_lora_rank 512 LoRA low-rank dimension of key-value (K/V)
qk_rope_head_dim 64 Q/K head dimension using RoPE
v_head_dim 128 Head dimension of value (V)
qk_nope_head_dim 128 head dimension without using RoPE
  1. Other key parameters
Parameter Default value Explanation
initializer_range 0.02 Weight Initialization Range
use_cache TRUE Whether to cache KV status (for generating tasks)
pad_token_id None IDFill token ID
bos_token_id 0 IDSentence start token ID
eos_token_id 1 IDEnd of sentence token ID
tie_word_embeddings FALSE Whether to bind input/output word embeddings
  1. Core Features of DeepSeek-R1-Zero

    Super Long Context Support

    Efficient MoE Architecture

    Low-Rank Adaptation (LoRA)

    FP8 Quantized Inference

    GQA Attention

  2. Comparison with DeepSeek-V3

Feature DeepSeek-V3 DeepSeek-R1-Zero
Max Context Length 4,096 163,840
RoPE Scaling Basic linear/dynamic YaRN-optimized
MoE Experts Same configuration Same configuration
LoRA Configuration Same configuration Same configuration
Quantization Support None FP8 Dynamic Quantization
Primary Use Case General pre-training Long-context optimization
  1. Conclusion

DeepSeek-R1-Zero builds upon DeepSeek-V3, prioritizing optimizations for long-context processing and inference efficiency (FP8 quantization), making it ideal for scenarios requiring ultra-long text understanding.