双裁切近端策略优化算法

doi:10.15888/j.cnki.csa.009033

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年4月4日 17:45 星期五

首页 > 过刊浏览>2023年第32卷第4期 >177-186. DOI:10.15888/j.cnki.csa.009033

PDF HTML阅读 XML下载导出引用引用提醒

双裁切近端策略优化算法
DOI:
                        10.15888/j.cnki.csa.009033
                    
CSTR:
                        
                    
作者:
                        张骏张骏
东莞理工学院 电子工程与智能化学院, 东莞 523808;东莞理工学院 计算机科学与技术学院, 东莞 523808
在期刊界中查找
在百度中查找
在本站中查找
王红成王红成
东莞理工学院 电子工程与智能化学院, 东莞 523808
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:广东省普通高校重点科研平台和项目(2020ZDZX3075)

Proximal Policy Optimization with Double Clipping Boundaries

Author:

ZHANG Jun
ZHANG Jun
School of Electrical Engineering and Intelligentization, Dongguan University of Technology, Dongguan 523808, China;School of Computer Science and Technology, Dongguan University of Technology, Dongguan 523808, China
在期刊界中查找
在百度中查找
在本站中查找
WANG Hong-Cheng
WANG Hong-Cheng
School of Electrical Engineering and Intelligentization, Dongguan University of Technology, Dongguan 523808, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

近端策略优化(proximal policy optimization, PPO)是一种稳定的深度强化学习算法, 该算法的关键点之一是使用裁切后的代理目标限制更新步长. 实验发现当使用经验最优的裁切系数时, KL散度 (Kullback-Leibler divergence)无法被确立上界, 这有悖于置信域优化理论. 本文提出一种改进的双裁切近端策略优化算法(proximal policy optimization with double clipping boundaries, PPO-DC). 该算法通过基于概率的两段裁切边界调整KL散度, 将参数限制在置信域内, 以保证样本数据得到充分利用. 在多个连续控制任务中, PPO-DC算法取得了好于其他算法的性能.

关键词:强化学习;策略梯度;近端策略优化;裁切机制

Abstract:

Proximal policy optimization (PPO) is a stable deep reinforcement learning algorithm. The key process of the algorithm is to use clipped surrogate targets to limit step size updates. Experiments have found that when a clipping coefficient with optimal experience is employed, the upper bound of Kullback-Leibler (KL) divergence cannot be determined. This phenomenon is against the optimization theory of trust region. In this study, an improved PPO with double clipping boundaries (PPO-DC) algorithm is proposed. The algorithm adjusts the KL divergence based on two probability-based clipping boundaries and limits parameters to the trust region, so as to ensure that the sample data are fully utilized. In several continuous control tasks, the PPO-DC algorithm achieves better performance than other algorithms.

Key words:reinforcement learning;policy gradient (PG);proximal policy optimization (PPO);clipping mechanism

引用本文

张骏,王红成.双裁切近端策略优化算法.计算机系统应用,2023,32(4):177-186

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2022-08-23
最后修改日期:2022-09-27
录用日期:
在线发布日期: 2022-12-23
出版日期:

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码