异构并行平台的Caffe推理速度提升方法
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:


Caffe Inference Acceleration Method on Heterogeneous Parallel Platform
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    随着计算机硬件性能的提高, 目前在个人终端上也开始出现使用预训练机器学习模型进行推理的运用. Caffe是一款流行的深度学习框架, 擅长图像分类等任务, 但是在默认状态下只能单核运行, 无法充分发挥异构并行计算设备的计算能力. 深度学习对于计算性能的要求较高, 如果能并行化以充分使用所有计算设备, 就能提升计算速度和使用体验. 由于CPU和GPU的计算性能之比在不同模型下存在差异, 因此不能简单将任务均分到多个计算设备. 而任务拆分过多或者需要等待多设备完成任务后同步的调度算法会引入更多开销. 因此, 还需要设计合适的调度算法减少设备空闲时间, 才能获得更好的性能. 已有一些提高Caffe并行表现的方法, 但是对于具体平台有限制且使用难度较高, 无法简单充分利用异构并行计算设备的计算能力. 本文将Caffe接口扩展, 使得自定义程序可以调用异构并行平台的多核或多计算设备使用Caffe进行深度学习推理. 接着将目前已有的多种调度算法运用到此类任务上并考察了运行效果. 为了减少已有调度算法的同步开销, 本文提出了先进先出调度和快速分块调度两种新的算法. 测试表明, 使用快速分块调度算法结合异构并行计算设备, Caffe的推理速度相比只使用单个CPU核心或者单个GPU都大幅提升. 而且, 相比已有调度算法中表现最好的HAT算法, 本文提出的快速分块调度算法在MNIST和Cifar-10两个数据集上分别减少了7.4%和21.0%的计算性能浪费.

    Abstract:

    With the development of computer performance, pre-trained machine learning models are used for inference on personal devices. Caffe is a popular deep learning framework featuring image classification. However, it can only infer using one CPU core or one GPU if without customization, which limits the computing power of heterogeneous parallel computation devices. Deep learning is a demanding task for a computation device. For a better user experience and faster inference, it is important to fully use all computing cores of the device via parallelization. Considering the CPU-to-GPU performance ratio may vary on different deep learning models, tasks should not just be equally assigned to all computing cores. It should be noted that more overhead will be introduced if the tasks are divided into too many portions or synchronized scheduling algorithms are used. Thus, a well-designed scheduling algorithm able to reduce idle time is crucial for better performance. Some approaches have been developed to improve Caffe performance on heterogeneous parallel computation devices, whereas there are some limits on the platform hardware and usage. As a result, it is difficult to fully utilize the performance of these devices. This study reports the work on the improvement of Caffe interface and the proposed new algorithms. Caffe interface is extended to enable customized programs to use multiple computing cores or devices of a heterogeneous parallel platform for deep learning inference with Caffe. Some existing scheduling algorithms are ported and tested. To avoid synchronization overhead, two novel asynchronous scheduling algorithms, async-FIFO and fast-split, are proposed. All scheduling algorithms are tested and results show that the Caffe inference performance of heterogeneous parallel computation devices adopting fast-split is significantly faster than that in the case where only one computing core is adopted. Fast-split on average reduces performance waste by 7.4% and 21.0% on MNIST and Cifar-10 datasets, respectively, compared with the current best heterogeneous parallel scheduling algorithm HAT.

    参考文献
    相似文献
    引证文献
引用本文

王子曦,邵培南,邓畅.异构并行平台的Caffe推理速度提升方法.计算机系统应用,2022,31(2):220-226

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-04-14
  • 最后修改日期:2021-05-11
  • 录用日期:
  • 在线发布日期: 2022-01-28
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号