基于代码语句掩码注意力机制的源代码迁移模型
作者:
基金项目:

国家自然科学基金(61902015, 61872026)


Source Code Migration Model Based on Code-statement Masked Attention Mechanism
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [31]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    源代码迁移技术旨在将源代码从一种编程语言转换至另一种编程语言, 以减轻开发人员迁移软件项目的负担. 现有研究通常利用神经机器翻译(NMT)模型将源代码转换为目标代码, 但这些研究忽略了代码结构特征, 导致源代码迁移性能不佳. 为此, 本文提出了基于代码语句掩码注意力机制的源代码迁移模型CSMAT (code-statement masked attention Transformer). 该模型利用Transformer的掩码注意力机制(masked attention mechanism), 在编码时引导模型理解源代码语句的语法和语义以及语句间上下文特征, 在译码时引导模型关注并对齐源代码语句, 从而提升源代码迁移性能. 本文使用真实项目数据集CodeTrans进行实证研究, 并使用4个指标评估模型性能. 实验结果验证了CSMAT的有效性, 同时验证了代码语句掩码注意力机制在预训练模型的适用性.

    Abstract:

    Source code migration techniques are designed to convert source code from one programming language to another, which helps reduce developers’ burden in migrating software projects. Existing studies mainly use neural machine translation (NMT) models to convert source code to target code. However, these studies ignore the code structure features, resulting in poor source code migration performance. Therefore, this study proposes a source code migration model based on a code-statement masked attention Transformer (CSMAT). The model uses Transformer’s masked attention mechanism to guide the model to understand the syntax and semantics of source code statements and inter-statement contextual features when encoding and make the model focus on and align the source code statements when decoding, so as to improve migration performance of source code. Empirical studies are conducted on the real project dataset, namely CodeTrans, and model performance is evaluated by using four metrics. The experimental results have validated the effectiveness of CSMAT and the applicability of the code-statement masked attention mechanism to pre-trained models.

    参考文献
    [1] Roziere B, Lachaux MA, Chanussot L, et al. Unsupervised translation of programming languages. Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 20601–20611.
    [2] Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha: Association for Computational Linguistics, 2014. 1724–1734.
    [3] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.
    [4] Chen XY, Liu C, Song D. Tree-to-tree neural networks for program translation. Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montreal: Curran Associates Inc., 2018. 2552–2562.
    [5] Shiv VL, Quirk C. Novel positional encodings to enable tree-based transformers. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019. 1082.
    [6] Jiang X, Zheng ZR, Lyu C, et al. TreeBERT: A tree-based pre-trained model for programming language. Proceedings of the 37th Conference on Uncertainty in Artificial Intelligence. PMLR, 2021. 54–63.
    [7] Feng ZY, Guo DY, Tang DY, et al. CodeBERT: A pre-trained model for programming and natural languages. Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 2020. 1536–1547.
    [8] Lu S, Guo DY, Ren S, et al. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. arXiv:2102.04664, 2021.
    [9] Hu X, Li G, Xia X, et al. Deep code comment generation. Proceedings of the 26th Conference on Program Comprehension. Gothenburg: Association for Computing Machinery, 2018. 200–210.
    [10] Allamanis M, Peng H, Sutton C. A convolutional attention network for extreme summarization of source code. Proceedings of the 33rd International Conference on Machine Learning. New York: PMLR, 2016. 2091–2100.
    [11] Allamanis M, Barr ET, Devanbu P, et al. A survey of machine learning for big code and naturalness. ACM Computing Surveys, 2019, 51(4): 81. [doi: 10.1145/3212695
    [12] Guo DY, Ren S, Lu S, et al. GraphCodeBERT: Pre-training code representations with data flow. Proceedings of the 9th International Conference on Learning Representations. ICLR, 2021.
    [13] Hindle A, Barr ET, Gabel M, et al. On the naturalness of software. Communications of the ACM, 2016, 59(5): 122–131. [doi: 10.1145/2902362
    [14] Zhang J, Wang X, Zhang HY, et al. A novel neural source code representation based on abstract syntax tree. Proceedings of the 41st IEEE/ACM International Conference on Software Engineering (ICSE). Montreal: IEEE. 2019. 783–794.
    [15] Yasumatsu K, Doi N. SPiCE: A system for translating Smalltalk programs into a C environment. IEEE Transactions on Software Engineering, 1995, 21(11): 902–912. [doi: 10.1109/32.473219
    [16] Bravenboer M, Kalleberg KT, Vermaas R, et al. Stratego/XT 0.17. A language and toolset for program transformation. Science of Computer Programming, 2008, 72(1–2): 52–70.
    [17] 石学林, 张兆庆, 武成岗. 自动化的Cobol 2 Java遗产代码迁移技术. 计算机工程, 2005, 31(12): 67–69
    [18] 刘静. Verilog-to-MSVL程序翻译软件的实现[硕士学位论文]. 西安: 西安电子科技大学, 2014.
    [19] Koehn P, Och FJ, Marcu D. Statistical phrase-based translation. Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. 2003. 127–133.
    [20] Nguyen AT, Nguyen TT, Nguyen TN. Lexical statistical machine translation for language migration. Proceedings of the 9th Joint Meeting on Foundations of Software Engineering. Saint Petersburg: Association for Computing Machinery, 2013. 651–654.
    [21] Karaivanov S, Raychev V, Vechev M. Phrase-based statistical translation of programming languages. Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software. Portland: Association for Computing Machinery, 2014. 173–184.
    [22] Nguyen AT, Nguyen TT, Nguyen TN. Divide-and-conquer approach for multi-phase statistical migration for source code (T). Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). Lincoln: IEEE, 2015. 585–596.
    [23] Devlin J, Chang MW, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis: Association for Computational Linguistics, 2019. 4171–4186.
    [24] Cho K, Van Merriënboer B, Bahdanau D, et al. On the properties of neural machine translation: Encoder-decoder approaches. Proceedings of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation. Doha: Association for Computational Linguistics, 2014. 103–111.
    [25] Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia: Association for Computational Linguistics, 2002. 311–318.
    [26] Li Z, Wu YH, Peng B, et al. SeCNN: A semantic CNN parser for code comment generation. Journal of Systems and Software, 2021, 181: 111036. [doi: 10.1016/j.jss.2021.111036
    [27] Yang G, Liu K, Chen X, et al. CCGIR: Information retrieval-based code comment generation method for smart contracts. Knowledge-based Systems, 2022, 237: 107858. [doi: 10.1016/j.knosys.2021.107858
    [28] Li Z, Wu YH, Peng B, et al. SeTransformer: A Transformer-based code semantic parser for code comment generation. IEEE Transactions on Reliability, 2023, 72(1): 258–273
    [29] See A, Liu PJ, Manning CD. Get to the point: Summarization with pointer-generator networks. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver: Association for Computational Linguistics, 2017. 1073–1083.
    [30] Loshchilov I, Hutter F. Decoupled weight decay regularization. Proceedings of the 7th International Conference on Learning Representations. New Orleans: ICLR, 2019.
    [31] Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 1994, 5(2): 157–166. [doi: 10.1109/72.279181
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

徐明瑞,李征,刘勇,吴永豪.基于代码语句掩码注意力机制的源代码迁移模型.计算机系统应用,2023,32(9):77-88

复制
分享
文章指标
  • 点击次数:559
  • 下载次数: 1963
  • HTML阅读次数: 1000
  • 引用次数: 0
历史
  • 收稿日期:2023-02-14
  • 最后修改日期:2023-03-14
  • 在线发布日期: 2023-06-09
文章二维码
您是第11348777位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号