结合动态缓冲池和时间递减约束的离线到在线强化学习
作者:
基金项目:

国家自然科学基金(62172292, 42375147)


Offline to Online Reinforcement Learning Combining Dynamic Replay Buffer and Time Decaying Constraint
Author:
  • YAN Lei-Ming

    YAN Lei-Ming

    Engineering Research Center of Digital Forensics Ministry of Education, Nanjing University of Information Science &Technology, Nanjing 210044, China;School of Computer Science and Cyber Science and Engineering, Nanjing University of Information Science & Technology, Nanjing 210044, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • ZHU Yong-Xin

    ZHU Yong-Xin

    Engineering Research Center of Digital Forensics Ministry of Education, Nanjing University of Information Science &Technology, Nanjing 210044, China;School of Computer Science and Cyber Science and Engineering, Nanjing University of Information Science & Technology, Nanjing 210044, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • LIU Jian

    LIU Jian

    Engineering Research Center of Digital Forensics Ministry of Education, Nanjing University of Information Science &Technology, Nanjing 210044, China;School of Computer Science and Cyber Science and Engineering, Nanjing University of Information Science & Technology, Nanjing 210044, China
    在期刊界中查找
    在百度中查找
    在本站中查找
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    离线到在线强化学习中, 虽然智能体能够通过预先收集的离线数据进行初步策略学习, 但在线微调阶段, 早期过程常常表现出不稳定性, 且微调结束后, 性能提升幅度较小. 针对这一问题, 提出了两种关键设计: 1)模拟退火的动态离线-在线缓冲池, 2)模拟退火的行为约束衰减. 第1种设计在训练过程中利用模拟退火思想动态选择离线数据或者在线交互经验, 获得优化的更新策略, 动态平衡在线训练的稳定性和微调性能; 第2种设计通过带降温机制的行为克隆约束, 改善微调早期使用在线经验更新导致的性能突降, 在微调后期逐渐放松约束, 促进模型性能提升. 实验结果表明, 所提出的结合动态缓冲池和时间递减约束的离线到在线强化学习(dynamic replay buffer and time decaying constraints, DRB-TDC)算法在Halfcheetah、Hopper、Walker2d这3个经典MuJoCo测试任务中, 在线微调训练后性能分别提升45%、65%、21%, 所有任务的平均归一化得分比最优基线算法提升10%.

    Abstract:

    In offline-to-online reinforcement learning, though the agent can leverage pre-collected offline data for initial policy learning, the online fine-tuning phase often exhibits instability in the early stages, and the performance improvement after fine-tuning is relatively small. To address this issue, two key designs are proposed: 1) a simulated annealing-based dynamic offline-online replay buffer and 2) simulated annealing-based behavior constraint attenuation. The first design dynamically selects offline data or online interaction experiences during training using the simulated annealing concept to obtain an optimized update strategy, dynamically balancing the stability of online training and fine-tuning performance. The second design introduces a behavior cloning constraint with a cooling mechanism to mitigate the sharp performance drop caused by using online experience updates in the early fine-tuning stage, gradually relaxing the constraint in the later stage to enhance model performance. Experimental results demonstrate that the proposed dynamic replay buffer and time decaying constraints (DRB-TDC) algorithm improves performance by 45%, 65%, and 21% on the HalfCheetah, Hopper, and Walker2d tasks from the MuJoCo benchmark after online fine-tuning, respectively. The average normalization score of all tasks exceeds the best baseline algorithm by 10%.

    参考文献
    相似文献
    引证文献
引用本文

闫雷鸣,朱永昕,刘健.结合动态缓冲池和时间递减约束的离线到在线强化学习.计算机系统应用,,():1-10

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-10-22
  • 最后修改日期:2024-11-19
  • 在线发布日期: 2025-03-24
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号