面向SW26010Pro处理器的全局符号重定位优化
作者:
基金项目:

国家重点研发计划(2020YFB0204602)


Optimized Global Symbol Relocations in SW26010Pro Processors
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [23]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    申威异构众核处理器运算核心访问主存的延迟很大, 程序中应尽量避免运算核心代码访问主存的操作. 全局偏移表存放程序中全局变量和函数的地址, 不适合保存在珍稀的运算核心局部存储空间中, 并且其访问模式通常比较离散, 因而也不适合对其做Cache预取, 访问全局偏移表引入的访问主存操作对程序性能影响较大. 本文针对异构众核程序静态链接与动态链接的使用场景, 分析链接器relaxation优化的使用限制, 通过“gp基地址+扩展偏移”的方法实现避免访问主存操作的全局符号重定位优化. 实验结果表明, 该重定位优化方法能够以增加少量代码为代价, 在运算核心代码调用函数与访问全局变量时有效避免访问全局偏移表引入的访问主存的操作, 提高众核程序的运行性能.

    Abstract:

    The delay of the computing core access to the main memory of Shenwei heterogeneous many-core processors is very large, and thus the program should try to avoid the access of computing core code to main the memory as much as possible. The global offset table stores the addresses of global variables and functions in the program, which is not suitable to be saved in the rare local storage space of the computing core, and it is not suitable for cache prefetching because of its discrete access patterns. Therefore, accessing the main memory operation introduced by accessing the global offset table has a great influence on program performance. In view of the usage scenarios of static linking and dynamic linking of heterogeneous many-core programs, the usage limitations of linker relaxation optimization are analyzed, and a global symbol relocation optimization method is designed based on “gp address base+extended offset” to avoid accessing the main memory. Experimental results show that at the cost of adding a small amount of code, the relocation optimization method can effectively avoid the operation of accessing the main memory introduced by accessing the global offset table when the computing core code calls functions and accesses global variables, which improves the running performance of many-core programs.

    参考文献
    [1] Lindholm E, Nickolls J, Oberman S, et al. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 2008, 28(2): 39–55.
    [2] Macri J. AMD’s next generation GPU and high bandwidth memory architecture: FURY. Proceedings of the 2015 IEEE Hot Chips 27 Symposium. Cupertino: IEEE, 2015. 1–26.
    [3] Jeffers J. Intel® Xeon PhiTM Coprocessors. Shi X, Kindratenko V, Yang CW. Modern Accelerator Technologies for Geographic Information Science. New York: Springer, 2013. 25–39.
    [4] 胡向东, 柯希明, 尹飞, 等. 高性能众核处理器申威26010. 计算机研究与发展, 2021, 58(6): 1155–1165.
    [5] Liao XK, Xiao LQ, Yang CQ, et al. MilkyWay-2 supercomputer: System and application. Frontiers of Computer Science, 2014, 8(3): 345–356.
    [6] Yang C, Xue W, Fu HH, et al. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. Proceedings of the 2016 International Conference for High Performance Computing, Networking, Storage and Analysis. Salt Lake City: IEEE, 2016. 57–68.
    [7] Fu HH, He CH, Chen BW, et al. 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: Enabling depiction of 18-Hz and 8-meter scenarios. Proceedings of the 2017 International Conference for High Performance Computing, Networking, Storage and Analysis. Denver: IEEE, 2017. 1–12.
    [8] Liu Y, Liu X, Li F, et al. Closing the “quantum supremacy” gap: Achieving real-time simulation of a random quantum circuit using a new sunway supercomputer. Proceedings of the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis. St. Louis: IEEE, 2021. 1–12.
    [9] Wulf WA, McKee SA. Hitting the memory wall: Implications of the obvious. ACM Sigarch Computer Architecture News, 1995, 23(1): 20–24.
    [10] Cooper KD, Kennedy K, Torczon L. Interprocedural optimization: Eliminating unnecessary recompilation. Proceedings of the 1986 SIGPLAN symposium on Compiler construction. Palo Alto: ACM, 1986. 58–67.
    [11] Glek T, Hubicka J. Optimizing real world applications with GCC link time optimization. arXiv:1010.2196, 2010.
    [12] Johnson T, Amini M, Li XD. ThinLTO: Scalable and incremental LTO. Proceedings of the 2017 IEEE/ACM International Symposium on Code Generation and Optimization. Austin: ACM, 2017. 111–121.
    [13] Cooper KD, Hall MW, Torczon L. An experiment with inline substitution. Software: Practice and Experience, 1991, 21(6): 581–601.
    [14] Davidson JW, Holler AM. A study of a C function inliner. Software: Practice and Experience, 1988, 18(8): 775–790.
    [15] Jessica Paquette. Reducing code size using outlining. https://www.llvm.org/devmtg/2016-11/Slides/Paquette-Outliner.pdf. (2019-10-30)[2023-06-16].
    [16] Rocha RCO, Petoumenos P, Wang Z, et al. Function merging by sequence alignment. Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization. Washington: IEEE, 2019. 149–163.
    [17] Srivastava A, Wall DW. Link-time optimization of address calculation on a 64-bit architecture. ACM Sigplan Notices, 1994, 29(6): 49–60.
    [18] MaskRay. The dark side of RISC-V linker relaxation. https://maskray.me/blog/2021-03-14-the-dark-side-of-riscv-linker-relaxation. (2021-03-14).
    [19] 高捷, 刘沙, 黄则强, 等. 基于国产众核处理器的深度神经网络算子加速库优化. 计算机科学, 2022, 49(5): 355–362.
    [20] Shang HH, Shen L, Fan Y, et al. Large-scale simulation of quantum computational chemistry on a new sunway supercomputer. Proceedings of the 2022 International Conference for High Performance Computing, Networking, Storage and Analysis. Dallas: IEEE, 2022. 1–14.
    [21] He KM, Zhang XY, Ren SQ, et al. Deep residual learning for image recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 770–778.
    [22] Devlin J, Chang MW, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis: Association for Computational Linguistics, 2019. 4171–4186.
    [23] Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners. OpenAI, 2019, 1(8): 9.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

钱宏,王飞,刘沙,郑天宇,宋佳伟,安虹.面向SW26010Pro处理器的全局符号重定位优化.计算机系统应用,2024,33(2):62-71

复制
分享
文章指标
  • 点击次数:520
  • 下载次数: 1308
  • HTML阅读次数: 812
  • 引用次数: 0
历史
  • 收稿日期:2023-07-16
  • 最后修改日期:2023-08-21
  • 在线发布日期: 2023-11-24
  • 出版日期: 2023-02-05
文章二维码
您是第10786853位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号