DeepSeek-R1-Zero model source code interpretation

Config configuration used

DeepseekV3Config, because the Config configuration of DeepSeek-R1-Zero model continues to use the Config configuration of DeepseekV3 model

Core architecture parameters

Parameter	Default value	Explanation
vocab_size	129280	Vocabulary size
hidden_size	7168	Transformer hidden layer dimension
intermediate_size	18432	MLP intermediate layer dimension
moe_intermediate_size	2048	Intermediate dimension of MoE expert layer
num_hidden_layers	61	Transformer layer
num_attention_heads	128	Head of attention
num_key_value_heads	128	Number of key-value headers (GQA configuration)
max_position_embeddings	4096	Maximum sequence length
hidden_act	"silu"	Activation function (SiLU)
rms_norm_eps	1.00E-06	RMS normalized epsilon
rope_theta	10000	Fundamental frequency of RoPE rotary position coding
rope_scaling	None	RoPE's scaling strategy (supports linear/dynamic)

MoE (Hybrid Expert) configuration

Parameter	Default value	Explanation
n_shared_experts	1	Number of shared experts (None for fully dense models)
n_routed_experts	256	Number of routing experts
num_experts_per_toknum_specialists_per_tok	8	Number of experts selected per token
moe_layer_freq	1	MoE layer frequency (insert 1 MoE layer per moe_layer_freq layer)
first_k_dense_replace	3	The first k layers use dense layers instead of MoE.
topk_method	"noaux_tc"	Expert selection method
n_group	8	Number of expert groups
topk_group	4	Number of expert groups selected per token
routed_scaling_factor	2.5	Scaling factor for routing experts
norm_topk_prob	TRUE	Whether to normalize expert weights
scoring_func	"sigmoid"	Expert weight calculation method

Attention Mechanism & LoRA Configuration

Parameter	Default value	Explanation
num_key_value_heads	128	Number of key-value headers (affecting GQA/MQA/MHA)
attention_bias	FALSE	Whether to use bias in Q/K/V projection
attention_dropout	0	Attention probability dropout rate
q_lora_rank	1536	LoRA low-rank dimension of query (Q)
kv_lora_rank	512	LoRA low-rank dimension of key-value (K/V)
qk_rope_head_dim	64	Q/K head dimension using RoPE
v_head_dim	128	Head dimension of value (V)
qk_nope_head_dim	128	head dimension without using RoPE

Other key parameters

Parameter	Default value	Explanation
initializer_range	0.02	Weight Initialization Range
use_cache	TRUE	Whether to cache KV status (for generating tasks)
pad_token_id	None	IDFill token ID
bos_token_id	0	IDSentence start token ID
eos_token_id	1	IDEnd of sentence token ID
tie_word_embeddings	FALSE	Whether to bind input/output word embeddings

Core Features of DeepSeek-R1-Zero

① Super Long Context Support
- Implements a 163k-token context window through YaRN-scaled RoPE (original 4k → scaled 40x).
② Efficient MoE Architecture
- 256 experts + grouped routing (n_group=8), with 8 experts activated per token.
③ Low-Rank Adaptation (LoRA)
- Uses different ranks for queries (Q) and key-values (K/V) (1536 vs. 512), balancing computation and parameter efficiency.
④ FP8 Quantized Inference
- Supports FP8 dynamic quantization, optimized for high-performance inference scenarios.
⑤ GQA Attention
- num_key_value_heads=128 (equivalent to MHA), retaining full attention capability.
Comparison with DeepSeek-V3

Feature	DeepSeek-V3	DeepSeek-R1-Zero
Max Context Length	4,096	163,840
RoPE Scaling	Basic linear/dynamic	YaRN-optimized
MoE Experts	Same configuration	Same configuration
LoRA Configuration	Same configuration	Same configuration
Quantization Support	None	FP8 Dynamic Quantization
Primary Use Case	General pre-training	Long-context optimization

Conclusion

DeepSeek-R1-Zero builds upon DeepSeek-V3, prioritizing optimizations for long-context processing and inference efficiency (FP8 quantization), making it ideal for scenarios requiring ultra-long text understanding.