面壁智能MiniCPM-SALA：融合稀疏注意力与线性注意力的高效长上下文建模.pdfVIP

面壁智能MiniCPM-SALA：融合稀疏注意力与线性注意力的高效长上下文建模.pdf

MiniCPM-SALA

MiniCPM-SALA:HybridizingSparseandLinearAttention

forEfficientLong-ContextModeling

MiniCPMTeam

https://huggingface.co/openbmb/MiniCPM-SALA

/OpenBMB/MiniCPM

Abstract

Theevolutionoflargelanguagemodels(LLMs)towardsapplicationswithultra-long

contextsfaceschallengesposedbythehighcomputationalandmemorycostsofthe

Transformerarchitecture.Whileexistingsparseandlinearattentionmechanismsattempt

tomitigatetheseissues,theytypicallyinvolveatrade-offbetweenmemoryefficiencyand

modelperformance.ThispaperintroducesMiniCPM-SALA,ahybridarchitecturethat

integratesthehigh-fidelitylong-contextmodelingofsparseattention(InfLLM-V2)withthe

globalefficiencyoflinearattention(LightningAttention).Byemployingalayerselection

algorithmtointegratethesemechanismsina1:3ratioandutilizingahybridpositional

encoding(HyPE),themodelmaintainsefficiencyandperformanceforlong-contexttasks.

Furthermore,weintroduceacost-effectivecontinualtrainingframeworkthattransforms

pre-trainedTransformer-basedmodelsintohybridmodels,whichreducestrainingcostsby

approximately75%comparedtotrainingfromscratch.Extensiveexperimentsshowthat

MiniCPM-SALAmaintainsgeneralcapabilitiescomparabletofull-attentionmodelswhile

offeringimprovedefficiency.OnasingleNVIDIAA6000DGPU,themodelachieves

inferencespeedsupto3.5×fasterthanfullattentionmodelsatthesequencelengthof

256Ktokensandsupports

更多 >