本文已被:浏览 1845次 下载 2698次
Received:February 21, 2017 Revised:March 09, 2017
Received:February 21, 2017 Revised:March 09, 2017
中文摘要: BLAS (Basic Linear Algebra Subprograms)是一个以向量和矩阵为操作对象的基础函数库.该库中函数分为3个级别,各个级别分别提供了向量-向量(1级)、向量-矩阵(2级)、矩阵-矩阵(3级)之间的基本运算.本文研究如何在申威众核处理器上BLAS-1、2级函数的并行实现,并充分利用平台特性对它们进行深度的性能调优,归纳总结程序在申威平台上的并行实现与优化技巧.申威26010 CPU采用了异构众核架构,众多计算核心提供的大规模并行处理能力,使单块芯片具有3 TFLOPS的双精度浮点计算性能.实验结果显示BLAS-1、2级函数相对于GotoBLAS参考实现版的平均加速比分别高达11.x和6.x,对于每一优化手段,均有明显的性能加速.
Abstract:BLAS (Basic Linear Algebra Subprograms) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. The functions in this library are divided into three levels, and each level provides basic operations between vector-vector (level 1), vector-matrix (level 2), and matrix-matrix (level 3), respectively. In this paper, we study the parallel implementation of BLAS level 1 and level 2 functions on Shenwei many-core processor, and make full use of the characteristics of the platform to optimize their performance, and sum up the parallel implementation and optimization techniques of the program on Shenwei platform. Shenwei 26010 CPU uses heterogeneous multi-core architecture, and has an obvious advantage in operating speed. Many computing cores provide large-scale parallel processing capabilities, so that, double precision floating-point computing performance of one single chip can reach 3TFLOPS. The experimental results show that the average speedup of BLAS level 1 and level 2 functions is as high as 11.x and 6.x times of GotoBLAS reference implementations respectively.
文章编号: 中图分类号: 文献标志码:
基金项目:国家自然科学基金重大研究计划集成项目(91530323);国家高技术研究发展计划(863计划)(2015AA01A302)
引用文本:
孙家栋,孙乔,邓攀,杨超.基于申威众核处理器的1、2级BLAS函数优化研究.计算机系统应用,2017,26(11):101-108
SUN Jia-Dong,SUN Qiao,DENG Pan,YANG Chao.Research on the Optimization of BLAS Level 1 and 2 Functions on Shenwei Many-Core Processor.COMPUTER SYSTEMS APPLICATIONS,2017,26(11):101-108
孙家栋,孙乔,邓攀,杨超.基于申威众核处理器的1、2级BLAS函数优化研究.计算机系统应用,2017,26(11):101-108
SUN Jia-Dong,SUN Qiao,DENG Pan,YANG Chao.Research on the Optimization of BLAS Level 1 and 2 Functions on Shenwei Many-Core Processor.COMPUTER SYSTEMS APPLICATIONS,2017,26(11):101-108