面向SW26010Pro处理器的全局符号重定位优化

doi:10.15888/j.cnki.csa.009393

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年3月16日 16:24 星期日

首页 > 过刊浏览>2024年第33卷第2期 >62-71. DOI:10.15888/j.cnki.csa.009393

PDF HTML阅读 XML下载导出引用引用提醒

面向SW26010Pro处理器的全局符号重定位优化
DOI:
                        10.15888/j.cnki.csa.009393
                    
CSTR:
                        32024.14.csa.009393
                    
作者:
                        钱宏钱宏
中国科学技术大学 计算机科学与技术学院, 合肥 230026
在期刊界中查找
在百度中查找
在本站中查找
王飞王飞
清华大学 计算机科学与技术系, 北京 100084
在期刊界中查找
在百度中查找
在本站中查找
刘沙刘沙
中国科学技术大学 计算机科学与技术学院, 合肥 230026
在期刊界中查找
在百度中查找
在本站中查找
郑天宇郑天宇
之江实验室, 杭州 311121
在期刊界中查找
在百度中查找
在本站中查找
宋佳伟宋佳伟
国家超级计算无锡中心, 无锡 214000
在期刊界中查找
在百度中查找
在本站中查找
安虹安虹
中国科学技术大学 计算机科学与技术学院, 合肥 230026
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家重点研发计划(2020YFB0204602)

Optimized Global Symbol Relocations in SW26010Pro Processors

Author:

QIAN Hong
QIAN Hong
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China
在期刊界中查找
在百度中查找
在本站中查找
WANG Fei
WANG Fei
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Sha
LIU Sha
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China
在期刊界中查找
在百度中查找
在本站中查找
ZHENG Tian-Yu
ZHENG Tian-Yu
Zhejiang Lab, Hangzhou 311121, China
在期刊界中查找
在百度中查找
在本站中查找
SONG Jia-Wei
SONG Jia-Wei
National Supercomputing Center in Wuxi, Wuxi 214000, China
在期刊界中查找
在百度中查找
在本站中查找
AN Hong
AN Hong
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [23]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

申威异构众核处理器运算核心访问主存的延迟很大, 程序中应尽量避免运算核心代码访问主存的操作. 全局偏移表存放程序中全局变量和函数的地址, 不适合保存在珍稀的运算核心局部存储空间中, 并且其访问模式通常比较离散, 因而也不适合对其做Cache预取, 访问全局偏移表引入的访问主存操作对程序性能影响较大. 本文针对异构众核程序静态链接与动态链接的使用场景, 分析链接器relaxation优化的使用限制, 通过“gp基地址+扩展偏移”的方法实现避免访问主存操作的全局符号重定位优化. 实验结果表明, 该重定位优化方法能够以增加少量代码为代价, 在运算核心代码调用函数与访问全局变量时有效避免访问全局偏移表引入的访问主存的操作, 提高众核程序的运行性能.

关键词:众核处理器;全局偏移表;重定位;链接器优化;性能

Abstract:

The delay of the computing core access to the main memory of Shenwei heterogeneous many-core processors is very large, and thus the program should try to avoid the access of computing core code to main the memory as much as possible. The global offset table stores the addresses of global variables and functions in the program, which is not suitable to be saved in the rare local storage space of the computing core, and it is not suitable for cache prefetching because of its discrete access patterns. Therefore, accessing the main memory operation introduced by accessing the global offset table has a great influence on program performance. In view of the usage scenarios of static linking and dynamic linking of heterogeneous many-core programs, the usage limitations of linker relaxation optimization are analyzed, and a global symbol relocation optimization method is designed based on “gp address base+extended offset” to avoid accessing the main memory. Experimental results show that at the cost of adding a small amount of code, the relocation optimization method can effectively avoid the operation of accessing the main memory introduced by accessing the global offset table when the computing core code calls functions and accesses global variables, which improves the running performance of many-core programs.

Key words:many-core processor;global offset table (GOT);relocation;linker optimization;performance

参考文献

[1] Lindholm E, Nickolls J, Oberman S, et al. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 2008, 28(2): 39–55.

[2] Macri J. AMD’s next generation GPU and high bandwidth memory architecture: FURY. Proceedings of the 2015 IEEE Hot Chips 27 Symposium. Cupertino: IEEE, 2015. 1–26.

[3] Jeffers J. Intel^{^®} Xeon Phi^TM Coprocessors. Shi X, Kindratenko V, Yang CW. Modern Accelerator Technologies for Geographic Information Science. New York: Springer, 2013. 25–39.

[4] 胡向东, 柯希明, 尹飞, 等. 高性能众核处理器申威26010. 计算机研究与发展, 2021, 58(6): 1155–1165.

[5] Liao XK, Xiao LQ, Yang CQ, et al. MilkyWay-2 supercomputer: System and application. Frontiers of Computer Science, 2014, 8(3): 345–356.

[6] Yang C, Xue W, Fu HH, et al. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. Proceedings of the 2016 International Conference for High Performance Computing, Networking, Storage and Analysis. Salt Lake City: IEEE, 2016. 57–68.

[7] Fu HH, He CH, Chen BW, et al. 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: Enabling depiction of 18-Hz and 8-meter scenarios. Proceedings of the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis. Denver: IEEE, 2017. 1–12.

[8] Liu Y, Liu X, Li F, et al. Closing the “quantum supremacy” gap: Achieving real-time simulation of a random quantum circuit using a new sunway supercomputer. Proceedings of the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis. St. Louis: IEEE, 2021. 1–12.

[9] Wulf WA, McKee SA. Hitting the memory wall: Implications of the obvious. ACM Sigarch Computer Architecture News, 1995, 23(1): 20–24.

[10] Cooper KD, Kennedy K, Torczon L. Interprocedural optimization: Eliminating unnecessary recompilation. Proceedings of the 1986 SIGPLAN symposium on Compiler construction. Palo Alto: ACM, 1986. 58–67.

[11] Glek T, Hubicka J. Optimizing real world applications with GCC link time optimization. arXiv:1010.2196, 2010.

[12] Johnson T, Amini M, Li XD. ThinLTO: Scalable and incremental LTO. Proceedings of the 2017 IEEE/ACM International Symposium on Code Generation and Optimization. Austin: ACM, 2017. 111–121.

[13] Cooper KD, Hall MW, Torczon L. An experiment with inline substitution. Software: Practice and Experience, 1991, 21(6): 581–601.

[14] Davidson JW, Holler AM. A study of a C function inliner. Software: Practice and Experience, 1988, 18(8): 775–790.

[15] Jessica Paquette. Reducing code size using outlining. https://www.llvm.org/devmtg/2016-11/Slides/Paquette-Outliner.pdf. (2019-10-30)[2023-06-16].

[16] Rocha RCO, Petoumenos P, Wang Z, et al. Function merging by sequence alignment. Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization. Washington: IEEE, 2019. 149–163.

[17] Srivastava A, Wall DW. Link-time optimization of address calculation on a 64-bit architecture. ACM Sigplan Notices, 1994, 29(6): 49–60.

[18] MaskRay. The dark side of RISC-V linker relaxation. https://maskray.me/blog/2021-03-14-the-dark-side-of-riscv-linker-relaxation. (2021-03-14).

[19] 高捷, 刘沙, 黄则强, 等. 基于国产众核处理器的深度神经网络算子加速库优化. 计算机科学, 2022, 49(5): 355–362.

[20] Shang HH, Shen L, Fan Y, et al. Large-scale simulation of quantum computational chemistry on a new sunway supercomputer. Proceedings of the 2022 International Conference for High Performance Computing, Networking, Storage and Analysis. Dallas: IEEE, 2022. 1–14.

[21] He KM, Zhang XY, Ren SQ, et al. Deep residual learning for image recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 770–778.

[22] Devlin J, Chang MW, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: Association for Computational Linguistics, 2019. 4171–4186.

[23] Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners. OpenAI, 2019, 1(8): 9.

引用本文

钱宏,王飞,刘沙,郑天宇,宋佳伟,安虹.面向SW26010Pro处理器的全局符号重定位优化.计算机系统应用,2024,33(2):62-71

复制

文章指标

点击次数:520
下载次数: 1308
HTML阅读次数: 812
引用次数: 0

历史

收稿日期:2023-07-16
最后修改日期:2023-08-21
录用日期:
在线发布日期: 2023-11-24
出版日期: 2023-02-05

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码