计算机系统应用  2019, Vol. 28 Issue (9): 88-94 PDF

Optimization of CNN Computing Task Partition Based on Many-Core BWDSP
WANG Gai, ZHENG Qi-Long, DENG Wen-Qi, YANG Jiang-Ping, LU Mao-Hui
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China
Foundation item: National Science and Technology Major Program (2012ZX01034-001-001)
Abstract: Convolutional Neural Network (CNN), which is one of the deep learning algorithms, has been applied in many fields. Because the scale and structure of the network model are complex and the model has large amount of data, it is necessary to reduce the requirements for computational resource. Generally, it needs to use data parallel strategy to partition and calculate tasks with large amount of data. However, just using data parallel strategy which does not combine with the characteristics of computing tasks, it would result in high volume data transmission. Because of that, it is essential to design a reasonable data partitioning strategy for reducing the amount of data transmission through the analysis of the network structure and the computing characteristics of CNN. Firstly, this paper introduces the optimization of computing tasks in deep learning accelerator. Then, it introduces the architecture of the deep learning accelerator based on many-core BWDSP and designs the strategy of computing partition. And it compares and analyzes the experimental results based on VGGNet-16. The experimental results show that the proposed optimization algorithm can significantly improve the performance of data transmission and reduce the amount of data transmission.
Key words: many-core BWDSP     data parallel     Convolutional Neural Network (CNN)     computing task partition

1 CNN网络模型与BWDSP众核架构 1.1 CNN网络模型

CNN是一种前馈神经网络 (feed neural networks), 其包含输入层、隐藏层、输出层. 其中隐藏层主要由卷积层、池化层和全连接层三类层次组成. 在较为复杂的CNN模型中, 隐藏层可能会包含多段卷积和池化层. 其中卷积层主要用来实现对输入的数据的特征的提取, 池化层主要是对特征进行选择和信息过滤, 而全连接层一般是作为隐藏层的最后一部分, 并将所包含的信息传递给下一层全连接层. 如图1显示的是较为简单的CNN模型-LeNet5[7], 其是LeCun Y设计用于手写数字识别的卷积神经网络, 具有2个卷积层, 2个池化层和2个全连接层.

 图 1 LeNet5网络结构图

1.2 BWDSP的众核架构

BWDSP系列处理器基于分簇式架构, 其指令系统支持VLIW和SIMD类型的操作. 每个处理器上有4个簇, 每个簇上有4个支持MAC操作的乘法器, 其最高可达30 GOPS的运算能力. 其体系结构和计算能力适合处理大数据量和大计算量的深度学习任务. 如图2为BWDSP[8]体系结构图.

 图 2 BWDSP体系结构

BWDSP众核架构的计算算法是把单个输出的计算任务分配给单独的核进行计算, 且与输出相关的计算所有的输入加载到计算核的局部内存中. 在单个计算核完成其计算时, 将该核的局部内存的数据输出给其他核, 当所有的核传输完时, 即下一层的输入准备完备后, 开始进行下一层的计算.

 图 3 BWDSP众核架构

2 计算任务的划分设计

 图 4 划分策略流程

 \left\{ \begin{aligned} & {O_H} = \left\lceil {\dfrac{{{I_H}}}{{{F_H}}}} \right\rceil \\ & {O_W} = \left\lceil {\dfrac{{{I_W}}}{{{F_W}}}} \right\rceil \end{aligned} \right. (1)
 $\left\{ {\begin{split} & {{O_H} = \left\lceil {\dfrac{{{I_H}{\rm{ - }}{F_H} + 1}}{{{F_H}}}} \right\rceil } \\ & {{O_W} = \left\lceil {\dfrac{{{I_W} - {F_W} + 1}}{{{F_W}}}} \right\rceil } \end{split}} \right.$ (2)
2.1 数据并行计算

 \left\{ {\begin{aligned} & {AV{G_H} = \left\lceil {\frac{{{I_H}}}{N}} \right\rceil } \\ & {AV{G_W} = \left\lceil {{I_W}} \right\rceil } \end{aligned}} \right. (3)

2.2 卷积任务的并行化划分设计

 图 5 卷积层融合

 \left\{ {\begin{aligned} & {AV{G_H} = \left\lceil {\dfrac{{\frac{{{O_H}}}{N} - 1}}{{{S_H}}} + {F_H}} \right\rceil } \\ & {AV{G_W} = \left\lceil {\dfrac{{{O_W} - 1}}{{{S_W}}} + {F_W}} \right\rceil } \end{aligned}} \right. (4)
 \left\{ {\begin{aligned} & {AV{G_H} = 4} \\ & {AV{G_W} = \left\lceil {\frac{{\frac{{{O_H} \times {O_W}}}{{2 \times N}} - 1}}{{{S_W}}} + {F_W}} \right\rceil } \end{aligned}} \right. (5)

3 实验分析 3.1 实验数据

3.2 实验结果与分析

 图 6 数据传输量对比图

4 总结与展望

 [1] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436-444. DOI:10.1038/nature14539 [2] Russakovsky O, Deng J, Su H, et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 2014, 115(3): 211-252. [3] Gu JX, Wang ZH, Kuen J, et al. Recent advances in convolutional neural networks. Pattern Recognition, 2018, 77: 354-377. DOI:10.1016/j.patcog.2017.10.013 [4] Le QV. Building high-level features using large scale unsupervised learning. Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, BC, Canada. 2013. 8595-8598. [5] Chen TS, Du ZD, Sun NH, et al. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM Sigplan Notices, 2014, 49(4): 269-284. [6] Parashar A, Rhu M, Mukkara A, et al. SCNN: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH Computer Architecture News, 2017, 45(2): 27-40. DOI:10.1145/3140659 [7] LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86(11): 2278-2324. DOI:10.1109/5.726791 [8] CET38. BWDSPl00软件用户手册. 合肥: 中国电子科技集团第三十八研究所, 2011. 181–191. [9] 邓文齐. 基于BWDSP的众核深度学习加速器的研究[硕士学位论文]. 合肥: 中国科学技术大学, 2018. [10] He KM, Zhang XY, Ren SQ, et al. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 2016. 770–778. [11] Abadi M, Barham P, Chen JM, et al. Tensorflow: A system for large-scale machine learning. Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. Savannah, GA, USA. 2016. 265–283. [12] Jia YQ, Shelhamer E, Donahue J, et al. Caffe: Convolutional architecture for fast feature embedding. Proceedings of the 22nd ACM International Conference on Multimedia. Orlando, Florida, USA. 2014. 675–678. [13] Hillis WD, Steele Jr GL. Data parallel algorithms. Communications of the ACM, 1986, 29(12): 1170-1183. DOI:10.1145/7902.7903 [14] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409. 1556, 2014.