- 0
- 0
- 约8.98万字
- 约 28页
- 2026-02-10 发布于浙江
- 举报
arXiv:2501.12599v1[cs.AI]22
arXiv:2501.12599v1
[cs.AI]
22Jan2025
KIMIK1.5:
SCALINGREINFORCEMENTLEARNINGWITHLLMS
TECHNICALREPORTOFKIMIK1.5
KimiTeam
ABSTRACT
Languagemodelpretrainingwithnexttokenpredictionhasprovedeffectiveforscalingcomputebutislimitedtotheamountofavailabletrainingdata.Scalingreinforcementlearning(RL)unlocksanewaxisforthecontinuedimprovementofartificialintelligence,withthepromisethatlargelanguagemodels(LLMs)canscaletheirtrainingdatabylearningtoexplorewithrewards.However,priorpublishedworkhasnotproducedcompetitiveresults.Inlightofthis,wereportonthetrainingpracticeofKimik1.5,ourlatestmulti-modalLLMtrainedwithRL,includingitsRLtrainingtechniques,multi-modaldatarecipes,andinfrastructureoptimization.Longcontextscalingandimprovedpolicyoptimizationmethodsarekeyingredientsofourapproach,whichestablishesasimplistic,effectiveRLframeworkwithoutrelyingonmorecomplextechniquessuchasMonteCarlotreesearch,valuefunctions,andprocessrewardmodels.Notably,oursystemachievesstate-of-the-artreasoningperformanceacrossmultiplebenchmarksandmodalities—e.g.,77.5onAIME,96.2onMATH500,94-thpercentileonCodeforces,74.9onMathVista—matchingOpenAI’so1.Moreover,wepresenteffectivelong2shortmethodsthatuselong-CoTtechniquestoimproveshort-CoTmodels,yieldingstate-of-the-artshort-CoTreasoningresults—e.g.,60.8onAIME,94.6onMATH500,47.3onLiveCodeBench—outperformingexistingshort-CoTmodelssuchasGPT-4oandClaudeSonnet3.5byalargemargin(upto+550%).
Kimik1.5long-CoT OpenAIo1 OpenAIo1-mini QVQ-72B-Preview QwQ-32BPreview
Math
96.294.89090.6
Code Vision
9494
88
77.574.4
63.6
74.9 77.3
67.2 7171.4 70 70.362 62.5
50 53.1
40.6
AIME2024(Pass@1)
MATH500(EM)
Codeforces(Percentile)
LiveCodeBenchv524.12-25.2(Pass@1)
MathVista(Pass@1)
MMMU(Pass@1)
Figure1:Kimi
原创力文档

文档评论(0)