本文已被:浏览 2199次 下载 4311次
Received:April 29, 2010 Revised:May 27, 2010
Received:April 29, 2010 Revised:May 27, 2010
中文摘要: 针对龙芯3A 体系结构以及二级BLAS 库函数的特点,在指令级、存储级和线程级抽取并行方案,总结了一些合适的优化方法,并对其进行了定量的分析。实验表明,这些优化可以将二级BLAS 函数单线程的性能提升20%以上,多线程下也可以得到2.5 倍左右的加速比,这对今后多核龙芯上的系统软件优化工作有着一定的帮助。
Abstract:According to characteristics of Loongson 3A architecture and BLAS level 2, this article derives the parallel solutions from instruction level, storage level and thread level. We summarize some suitable optimization methods and make a quantitative analysis. Experiment shows that the single-threading performance of BLAS level 2 is increased by 20%, and the multi-threading speedup reaches to 2.5. All of these will give some help to the optimization of system software on multi-core Loongson 3A.
keywords: Loongson 3A BLAS optimization Gemv Ger memory access multi-threading
文章编号: 中图分类号: 文献标志码:
基金项目:基金项目:国家高技术研究发展计划(863)(2008AA010902);自然科学基金(60833004)
引用文本:
李毅,何颂颂,李恺.多核龙芯3A 上二级BLAS 库的优化.计算机系统应用,2011,20(1):163-167
LI Yi,HE Song-Song,LI Kai.Optimization of BLAS Level 2 Based on Multi-Core Loongson 3A.COMPUTER SYSTEMS APPLICATIONS,2011,20(1):163-167
李毅,何颂颂,李恺.多核龙芯3A 上二级BLAS 库的优化.计算机系统应用,2011,20(1):163-167
LI Yi,HE Song-Song,LI Kai.Optimization of BLAS Level 2 Based on Multi-Core Loongson 3A.COMPUTER SYSTEMS APPLICATIONS,2011,20(1):163-167