- 48
- 0
- 约20.38万字
- 约 53页
- 2025-02-09 发布于北京
- 举报
DeepSeek-V3TechnicalReport
DeepSeek-AI
research@
Abstract
WepresentDeepSeek-V3,astrongMixture-of-Experts(MoE)languagemodelwith671Btotal
parameterswith37Bactivatedforeachtoken.Toachieveefficientinferenceandcost-effective
training,DeepSeek-V3adoptsMulti-headLatentAttention(MLA)andDeepSeekMoEarchitec-
tures,whichwerethoroughlyvalidatedinDeepSeek-V2.Furthermore,DeepSeek-V3pioneers
anauxiliary-loss-freestrategyforloadbalancingandsetsamulti-tokenpredictiontraining
objectiveforstrongerperformance.Wepre-trainDeepSeek-V3on14.8trilliondiverseand
high-qualitytokens,followedbySupervisedFine-TuningandReinforcementLearningstagesto
fullyharnessitscapabilities.ComprehensiveevaluationsrevealthatDeepSeek-V3outperforms
otheropen-sourcemodelsandachievesperformancecomparabletoleadingclosed-source
models.Despiteitsexcellentperformance,DeepSeek-V3requiresonly2.788MH800GPUhours
foritsfulltraining.Inaddition,itstrainingprocessisremarkablystable.Throughouttheentire
trainingprocess,wedidnotexperienceanyirrecoverablelossspikesorperformanyrollbacks.
Themodelcheckpointsareavailableat/deepseek-ai/DeepSeek-V3.
DeepSeek-V3DeepSeek-V2.5Qwen2.5-72B-InstLlama-3.1-405B-InstGPT-4o-0513Claude-3.5-Sonnet-1022
100
90.2
80.0
8078.078.3
75.9
74.774.6
73.372.6
原创力文档

文档评论(0)