- 0
- 0
- 约16.41万字
- 约 148页
- 2026-02-12 发布于湖南
- 举报
BridgingtheDigitaltoPhysicalDivide
EvaluatingLLMAgentsonBenchtopDNAAcquisition
Summary
iv
Therehavebeenstrikingdevelopmentsoverthepastseveralyearsintheperformanceofartificialintelligence(AI)andlargelanguagemodels(LLMs)inparticular.Thishasdrivenaneedformorerobustevaluationsofmodelcapabilities,particularlythoserelevanttosecurity-criticaldomains.OneimportantclassofcapabilityevaluationsexaminesLLMagentsthat
interactwiththeirenvironment,testingtheirabilitytoapplyinformationsituationallyandactautonomously.Inthisreport,wedescribethedesignandexecutionofanagentevaluation
designedtoprobecapabilitiesinappliedmolecularbiology.
BenchtopDNAsynthesisisrelevanttothreatmodelsinwhichamaliciousactorattemptstocreateaviralpathogen.1Wefocusourevaluationonthistask,presentingouragentswitha
simulatedinterfacetoabenchtopDNAsynthesizer.Wethenprompttheagentstoassista
hypotheticalnon-expertuserincreatingoneoftwoDNAsequencesofinterest:anenhancedgreenfluorescentprotein(eGFP)andasequenceencodinganinfluenzahemagglutinin(HA)protein.AgentsarealsoaskedtogenerateaprotocolforhandlingthesynthesizedDNAinalaboratorysetting.
WeoriginallytestedasetoffivemodelsthatwereatornearthefrontierofLLMcapability
inmid-2025:GPT-4.1ando3fromOpenAI,ClaudeSonnet4andOpus4fromAnthropic,and
Gemini2.5ProfromGoogle.ImmediatelyafteritsreleaseinAugust2025,weaddedGPT-5to
ourmodelset,thentestedClaudeOpus4.5andGemini3ProinJanuary2026aswell.AllLLMsweretestedwithinReActagentscaffolds,andthecoreevaluationwasimplementedinthe
InspectAIlibrary,aframeworkforLLMevaluationsdevelopedbytheUKAISecurityInstitute.2
Thekeyfindingsfromourtestinginclude:
?Themostrecentmodelswetested,GPT-5,Opus4.5,andGemini3Pro,reliably
designeGFPDNAsegmentsthats
原创力文档

文档评论(0)