连续时间统一maxq算法及其应用分析-continuous time unified maxq algorithm and its application analysis.docxVIP

下载本文档

7
0
约6.38万字
约 56页
2018-05-29 发布于上海
举报
版权申诉

连续时间统一maxq算法及其应用分析-continuous time unified maxq algorithm and its application analysis.docx

1、原创力文档（book118）网站文档一经付费（服务费），不意味着购买了该文档的版权，仅供个人/单位学习、研究之用，不得用于商业用途，未经授权，严禁复制、发行、汇编、翻译或者网络传播等，侵权必究。。
2、本站所有内容均由合作方或网友上传，本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺！文档内容仅供研究参考，付费前请自行鉴别。如您付费，意味着您自己接受本站规则且自行承担风险，本站不退款、不进行额外附加服务；查看《如何避免下载的几个坑》。如果您已付费下载过本站文档，您可以点击这里二次下载。
3、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等，请点击“版权申诉”（推荐），也可以打举报电话：400-050-0827(电话支持时间：9:00-18:30)。
4、该文档为VIP文档，如果想要下载，成为VIP会员后，下载免费。
5、成为VIP后，下载本文档将扣除1次下载权益。下载后，不支持退款、换文档。如有疑问请联系我们。
6、成为VIP后，您将拥有八大权益，权益包括：VIP文档下载权益、阅读免打扰、文档格式转换、高级专利检索、专属身份标志、高级客服、多端互通、版权登记。
7、VIP文档为合作方或网友上传，每下载1次，网站将根据用户上传文档的质量评分、类型等，对文档贡献者给予高额补贴、流量扶持。如果你也想贡献VIP文档。上传文档

连续时间统一maxq算法及其应用分析-continuous time unified maxq algorithm and its application analysis

连续时间统一 MAXQ 算法及其应用研究摘要具有抽象机制的分层强化学习方法可以实现状态空间的降维，从而解决大规模系统中的“维数灾”问题。由于引入了状态抽象机制，分层强化学习可以加快策略学习的速率并且节约学习过程中状态－行动对的存储空间。典型的分层强化学习有 Option 算法、HAM 算法以及 MAXQ 算法。然而传统的分层强化学习算法大都是建立在离散时间半 Markov 模型或者离散时间多 Agent 半 Markov 模型的框架下的，无法解决连续时间情况下单 Agent、多 Agent 的学习系统问题，并且算法只能单独适用于平均准则或者是折扣准则。本文在性能势理论框架下，结合现有的 MAXQ 算法思想和连续时间 SMDP 模型，提出一种同时适用于平均和折扣性能准则的连续时间统一 MAXQ 算法。由于 Web 服务组合问题可以建模成半 Markov 决策过程模型，因此本文将提出的算法应用于 Web 服务组合问题中，以验证算法具有实际意义。另外，通过旅游预订系统作为仿真实例，说明该算法与 Q 学习相比，具有优化精度高、优化速度快和节约存储空间的优势。但是，由于单 Agent 的能力有限，越来越多的复杂问题需要通过多 Agent 的相互协作来解决。因此本文结合性能势理论和之前构造的连续时间统一 MAXQ 算法思想，又提出了一种同时适用于平均和折扣性能准则的多 Agent 连续时间统一 MAXQ 算法，并将该算法应用于解决多 Agent 连续时间 Web 服务组合问题中。最后通过旅游预订系统作为仿真实例，说明该算法比 single-Agent MAXQ 和selfish multi-Agent MAXQ 算法都具有更好的优化效果，同时也加快了学习速率且节约了存储空间。关键词：半 Markov 决策过程(SMDP)；多 Agent 半 Markov 决策过程(MSMDP)；性能势；MAXQ 算法；Web 服务组合Continuous-Time Unified MAXQ Algorithm and Its ApplicationABSTRACTThe hierarchal reinforcement learning with abstraction mechanism can reduce the dimension of state space, so as to solve the problem of “curse of dimensionality” existing in the large-scale systems. Due to the abstraction mechanism, the hierarchal reinforcement learning can accelerate the policy learning speed and save the memory of state-action pairs. There are three typical hierarchical reinforcement learning algorithms: Option, HAM and MAXQ. However, traditional hierarchical reinforcement learning algorithms are mostly based on the framework of discrete-time SMDP model or discrete-time multi-agent SMDP model, which can not solve the continuous-time single agent, multi-agent learning system problems, and can only apply to the average criteria or discounted criteria.In this dissertation, under the framework of the concept of performance potential, combined with the existed MAXQ algorithm and continuous-time SMDP model, we introduce a continuous-time unified MAXQ algorithm under either average- or discounted-cost criteria. Because the web service composition problem can be modeled as