###
计算机系统应用英文版:2022,31(2):220-226
本文二维码信息
码上扫一扫!
异构并行平台的Caffe推理速度提升方法
(中国电子科技集团公司第三十二研究所, 上海 201808)
Caffe Inference Acceleration Method on Heterogeneous Parallel Platform
(The 32nd Research Institute of China Electronics Technology Group Corporation, Shanghai 201808, China)
摘要
图/表
参考文献
相似文献
本文已被:浏览 562次   下载 1451
Received:April 14, 2021    Revised:May 11, 2021
中文摘要: 随着计算机硬件性能的提高, 目前在个人终端上也开始出现使用预训练机器学习模型进行推理的运用. Caffe是一款流行的深度学习框架, 擅长图像分类等任务, 但是在默认状态下只能单核运行, 无法充分发挥异构并行计算设备的计算能力. 深度学习对于计算性能的要求较高, 如果能并行化以充分使用所有计算设备, 就能提升计算速度和使用体验. 由于CPU和GPU的计算性能之比在不同模型下存在差异, 因此不能简单将任务均分到多个计算设备. 而任务拆分过多或者需要等待多设备完成任务后同步的调度算法会引入更多开销. 因此, 还需要设计合适的调度算法减少设备空闲时间, 才能获得更好的性能. 已有一些提高Caffe并行表现的方法, 但是对于具体平台有限制且使用难度较高, 无法简单充分利用异构并行计算设备的计算能力. 本文将Caffe接口扩展, 使得自定义程序可以调用异构并行平台的多核或多计算设备使用Caffe进行深度学习推理. 接着将目前已有的多种调度算法运用到此类任务上并考察了运行效果. 为了减少已有调度算法的同步开销, 本文提出了先进先出调度和快速分块调度两种新的算法. 测试表明, 使用快速分块调度算法结合异构并行计算设备, Caffe的推理速度相比只使用单个CPU核心或者单个GPU都大幅提升. 而且, 相比已有调度算法中表现最好的HAT算法, 本文提出的快速分块调度算法在MNIST和Cifar-10两个数据集上分别减少了7.4%和21.0%的计算性能浪费.
Abstract:With the development of computer performance, pre-trained machine learning models are used for inference on personal devices. Caffe is a popular deep learning framework featuring image classification. However, it can only infer using one CPU core or one GPU if without customization, which limits the computing power of heterogeneous parallel computation devices. Deep learning is a demanding task for a computation device. For a better user experience and faster inference, it is important to fully use all computing cores of the device via parallelization. Considering the CPU-to-GPU performance ratio may vary on different deep learning models, tasks should not just be equally assigned to all computing cores. It should be noted that more overhead will be introduced if the tasks are divided into too many portions or synchronized scheduling algorithms are used. Thus, a well-designed scheduling algorithm able to reduce idle time is crucial for better performance. Some approaches have been developed to improve Caffe performance on heterogeneous parallel computation devices, whereas there are some limits on the platform hardware and usage. As a result, it is difficult to fully utilize the performance of these devices. This study reports the work on the improvement of Caffe interface and the proposed new algorithms. Caffe interface is extended to enable customized programs to use multiple computing cores or devices of a heterogeneous parallel platform for deep learning inference with Caffe. Some existing scheduling algorithms are ported and tested. To avoid synchronization overhead, two novel asynchronous scheduling algorithms, async-FIFO and fast-split, are proposed. All scheduling algorithms are tested and results show that the Caffe inference performance of heterogeneous parallel computation devices adopting fast-split is significantly faster than that in the case where only one computing core is adopted. Fast-split on average reduces performance waste by 7.4% and 21.0% on MNIST and Cifar-10 datasets, respectively, compared with the current best heterogeneous parallel scheduling algorithm HAT.
文章编号:     中图分类号:    文献标志码:
基金项目:
引用文本:
王子曦,邵培南,邓畅.异构并行平台的Caffe推理速度提升方法.计算机系统应用,2022,31(2):220-226
WANG Zi-Xi,SHAO Pei-Nan,DENG Chang.Caffe Inference Acceleration Method on Heterogeneous Parallel Platform.COMPUTER SYSTEMS APPLICATIONS,2022,31(2):220-226