Abstract:The Basic Linear Algebra Subprogram (BLAS) is a mathematical function standard for basic linear algebra operations. The library function is divided into three levels in which basic operations between vector and vector (level 1), vector and matrix (level 2), and vector and vector (level 3) are offered. In this paper, we study the optimization scheme of BLAS level1 functions on SW1621 processor. With the function AXPY as an example, the architectural characteristics of the platform are fully used to optimize its performance, and an automatic thread allocation scheme is designed. The experimental results show that compared with the reference implementation version of GotoBLAS, the optimized BLAS level1 function, AXPY, has a high single-core acceleration ratio of 4.36 and a multi-core one of 9.50 respectively. Every optimization scheme can improve the performance.