DeepseekV3Config, because the Config configuration of DeepSeek-R1-Zero model continues to use the Config configuration of DeepseekV3 model
Parameter | Default value | Explanation |
---|---|---|
vocab_size | 129280 | Vocabulary size |
hidden_size | 7168 | Transformer hidden layer dimension |
intermediate_size | 18432 | MLP intermediate layer dimension |
moe_intermediate_size | 2048 | Intermediate dimension of MoE expert layer |
num_hidden_layers | 61 | Transformer layer |
num_attention_heads | 128 | Head of attention |
num_key_value_heads | 128 | Number of key-value headers (GQA configuration) |
max_position_embeddings | 4096 | Maximum sequence length |
hidden_act | "silu" | Activation function (SiLU) |
rms_norm_eps | 1.00E-06 | RMS normalized epsilon |
rope_theta | 10000 | Fundamental frequency of RoPE rotary position coding |
rope_scaling | None | RoPE's scaling strategy (supports linear/dynamic) |
Parameter | Default value | Explanation |
---|---|---|
n_shared_experts | 1 | Number of shared experts (None for fully dense models) |
n_routed_experts | 256 | Number of routing experts |
num_experts_per_toknum_specialists_per_tok | 8 | Number of experts selected per token |
moe_layer_freq | 1 | MoE layer frequency (insert 1 MoE layer per moe_layer_freq layer) |
first_k_dense_replace | 3 | The first k layers use dense layers instead of MoE. |
topk_method | "noaux_tc" | Expert selection method |
n_group | 8 | Number of expert groups |
topk_group | 4 | Number of expert groups selected per token |
routed_scaling_factor | 2.5 | Scaling factor for routing experts |
norm_topk_prob | TRUE | Whether to normalize expert weights |
scoring_func | "sigmoid" | Expert weight calculation method |
Parameter | Default value | Explanation |
---|---|---|
num_key_value_heads | 128 | Number of key-value headers (affecting GQA/MQA/MHA) |
attention_bias | FALSE | Whether to use bias in Q/K/V projection |
attention_dropout | 0 | Attention probability dropout rate |
q_lora_rank | 1536 | LoRA low-rank dimension of query (Q) |
kv_lora_rank | 512 | LoRA low-rank dimension of key-value (K/V) |
qk_rope_head_dim | 64 | Q/K head dimension using RoPE |
v_head_dim | 128 | Head dimension of value (V) |
qk_nope_head_dim | 128 | head dimension without using RoPE |
Parameter | Default value | Explanation |
---|---|---|
initializer_range | 0.02 | Weight Initialization Range |
use_cache | TRUE | Whether to cache KV status (for generating tasks) |
pad_token_id | None | IDFill token ID |
bos_token_id | 0 | IDSentence start token ID |
eos_token_id | 1 | IDEnd of sentence token ID |
tie_word_embeddings | FALSE | Whether to bind input/output word embeddings |
Core Features of DeepSeek-R1-Zero
① Super Long Context Support
② Efficient MoE Architecture
n_group=8
), with 8 experts activated per token.③ Low-Rank Adaptation (LoRA)
④ FP8 Quantized Inference
⑤ GQA Attention
num_key_value_heads=128
(equivalent to MHA), retaining full attention capability.Comparison with DeepSeek-V3
Feature | DeepSeek-V3 | DeepSeek-R1-Zero |
---|---|---|
Max Context Length | 4,096 | 163,840 |
RoPE Scaling | Basic linear/dynamic | YaRN-optimized |
MoE Experts | Same configuration | Same configuration |
LoRA Configuration | Same configuration | Same configuration |
Quantization Support | None | FP8 Dynamic Quantization |
Primary Use Case | General pre-training | Long-context optimization |
DeepSeek-R1-Zero builds upon DeepSeek-V3, prioritizing optimizations for long-context processing and inference efficiency (FP8 quantization), making it ideal for scenarios requiring ultra-long text understanding.