Abstract:BLAS is one of the most important basic underlying math library for scientific computing,in which the level 3 BLAS functions are most widely used.In this paper,we provide a high-performance method to implement Level 3 BLAS functions based on domestic Sunway 1600 platform.To make it clear,we take GEMM as an example.For the implementation on single-core,we apply many tuning techniques related to the specific platform,such as multiply-add instructions,loop unrolling,software pipelining and instruction rearrangement,SIMD operations,and register blocking to push up the performance.For the multi-core implementation,we propose an efficient multi-threaded method.Compared with GotoBLAS,one of the famous open-source BLAS,the experiments show that our serial single-threaded method achieves a speedup of 4.72.What's more,the average speedup of 4-threaded execution towards the single-threaded one can also reach 3.02.