基于代码语句掩码注意力机制的源代码迁移模型

doi:10.15888/j.cnki.csa.009217

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年4月14日 12:26 星期一

首页 > 过刊浏览>2023年第32卷第9期 >77-88. DOI:10.15888/j.cnki.csa.009217

PDF HTML阅读 XML下载导出引用引用提醒

基于代码语句掩码注意力机制的源代码迁移模型
DOI:
                        10.15888/j.cnki.csa.009217
                    
CSTR:
                        
                    
作者:
                        徐明瑞徐明瑞
北京化工大学 信息科学与技术学院, 北京 100029
在期刊界中查找
在百度中查找
在本站中查找
李征李征
北京化工大学 信息科学与技术学院, 北京 100029
在期刊界中查找
在百度中查找
在本站中查找
刘勇刘勇
北京化工大学 信息科学与技术学院, 北京 100029
在期刊界中查找
在百度中查找
在本站中查找
吴永豪吴永豪
北京化工大学 信息科学与技术学院, 北京 100029
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家自然科学基金(61902015, 61872026)

Source Code Migration Model Based on Code-statement Masked Attention Mechanism

Author:

XU Ming-Rui
XU Ming-Rui
College of Information Science and Technology, Beijing University of Chemical Technology, Beijing 100029, China
在期刊界中查找
在百度中查找
在本站中查找
LI Zheng
LI Zheng
College of Information Science and Technology, Beijing University of Chemical Technology, Beijing 100029, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Yong
LIU Yong
College of Information Science and Technology, Beijing University of Chemical Technology, Beijing 100029, China
在期刊界中查找
在百度中查找
在本站中查找
WU Yong-Hao
WU Yong-Hao
College of Information Science and Technology, Beijing University of Chemical Technology, Beijing 100029, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [31]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

源代码迁移技术旨在将源代码从一种编程语言转换至另一种编程语言, 以减轻开发人员迁移软件项目的负担. 现有研究通常利用神经机器翻译(NMT)模型将源代码转换为目标代码, 但这些研究忽略了代码结构特征, 导致源代码迁移性能不佳. 为此, 本文提出了基于代码语句掩码注意力机制的源代码迁移模型CSMAT (code-statement masked attention Transformer). 该模型利用Transformer的掩码注意力机制(masked attention mechanism), 在编码时引导模型理解源代码语句的语法和语义以及语句间上下文特征, 在译码时引导模型关注并对齐源代码语句, 从而提升源代码迁移性能. 本文使用真实项目数据集CodeTrans进行实证研究, 并使用4个指标评估模型性能. 实验结果验证了CSMAT的有效性, 同时验证了代码语句掩码注意力机制在预训练模型的适用性.

关键词:代码语句;掩码;代码迁移;机器翻译;注意力机制

Abstract:

Source code migration techniques are designed to convert source code from one programming language to another, which helps reduce developers’ burden in migrating software projects. Existing studies mainly use neural machine translation (NMT) models to convert source code to target code. However, these studies ignore the code structure features, resulting in poor source code migration performance. Therefore, this study proposes a source code migration model based on a code-statement masked attention Transformer (CSMAT). The model uses Transformer’s masked attention mechanism to guide the model to understand the syntax and semantics of source code statements and inter-statement contextual features when encoding and make the model focus on and align the source code statements when decoding, so as to improve migration performance of source code. Empirical studies are conducted on the real project dataset, namely CodeTrans, and model performance is evaluated by using four metrics. The experimental results have validated the effectiveness of CSMAT and the applicability of the code-statement masked attention mechanism to pre-trained models.

Key words:code statement;mask;code migration;machine translation;attention mechanism

参考文献

[1] Roziere B, Lachaux MA, Chanussot L, et al. Unsupervised translation of programming languages. Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 20601–20611.

[2] Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha: Association for Computational Linguistics, 2014. 1724–1734.

[3] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.

[4] Chen XY, Liu C, Song D. Tree-to-tree neural networks for program translation. Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montreal: Curran Associates Inc., 2018. 2552–2562.

[5] Shiv VL, Quirk C. Novel positional encodings to enable tree-based transformers. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019. 1082.

[6] Jiang X, Zheng ZR, Lyu C, et al. TreeBERT: A tree-based pre-trained model for programming language. Proceedings of the 37th Conference on Uncertainty in Artificial Intelligence. PMLR, 2021. 54–63.

[7] Feng ZY, Guo DY, Tang DY, et al. CodeBERT: A pre-trained model for programming and natural languages. Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 2020. 1536–1547.

[8] Lu S, Guo DY, Ren S, et al. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. arXiv:2102.04664, 2021.

[9] Hu X, Li G, Xia X, et al. Deep code comment generation. Proceedings of the 26th Conference on Program Comprehension. Gothenburg: Association for Computing Machinery, 2018. 200–210.

[10] Allamanis M, Peng H, Sutton C. A convolutional attention network for extreme summarization of source code. Proceedings of the 33rd International Conference on Machine Learning. New York: PMLR, 2016. 2091–2100.

[11] Allamanis M, Barr ET, Devanbu P, et al. A survey of machine learning for big code and naturalness. ACM Computing Surveys, 2019, 51(4): 81. [doi: 10.1145/3212695

[12] Guo DY, Ren S, Lu S, et al. GraphCodeBERT: Pre-training code representations with data flow. Proceedings of the 9th International Conference on Learning Representations. ICLR, 2021.

[13] Hindle A, Barr ET, Gabel M, et al. On the naturalness of software. Communications of the ACM, 2016, 59(5): 122–131. [doi: 10.1145/2902362

[14] Zhang J, Wang X, Zhang HY, et al. A novel neural source code representation based on abstract syntax tree. Proceedings of the 41st IEEE/ACM International Conference on Software Engineering (ICSE). Montreal: IEEE. 2019. 783–794.

[15] Yasumatsu K, Doi N. SPiCE: A system for translating Smalltalk programs into a C environment. IEEE Transactions on Software Engineering, 1995, 21(11): 902–912. [doi: 10.1109/32.473219

[16] Bravenboer M, Kalleberg KT, Vermaas R, et al. Stratego/XT 0.17. A language and toolset for program transformation. Science of Computer Programming, 2008, 72(1–2): 52–70.

[17] 石学林, 张兆庆, 武成岗. 自动化的Cobol 2 Java遗产代码迁移技术. 计算机工程, 2005, 31(12): 67–69

[18] 刘静. Verilog-to-MSVL程序翻译软件的实现[硕士学位论文]. 西安: 西安电子科技大学, 2014.

[19] Koehn P, Och FJ, Marcu D. Statistical phrase-based translation. Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. 2003. 127–133.

[20] Nguyen AT, Nguyen TT, Nguyen TN. Lexical statistical machine translation for language migration. Proceedings of the 9th Joint Meeting on Foundations of Software Engineering. Saint Petersburg: Association for Computing Machinery, 2013. 651–654.

[21] Karaivanov S, Raychev V, Vechev M. Phrase-based statistical translation of programming languages. Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software. Portland: Association for Computing Machinery, 2014. 173–184.

[22] Nguyen AT, Nguyen TT, Nguyen TN. Divide-and-conquer approach for multi-phase statistical migration for source code (T). Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). Lincoln: IEEE, 2015. 585–596.

[23] Devlin J, Chang MW, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis: Association for Computational Linguistics, 2019. 4171–4186.

[24] Cho K, Van Merriënboer B, Bahdanau D, et al. On the properties of neural machine translation: Encoder-decoder approaches. Proceedings of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation. Doha: Association for Computational Linguistics, 2014. 103–111.

[25] Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia: Association for Computational Linguistics, 2002. 311–318.

[26] Li Z, Wu YH, Peng B, et al. SeCNN: A semantic CNN parser for code comment generation. Journal of Systems and Software, 2021, 181: 111036. [doi: 10.1016/j.jss.2021.111036

[27] Yang G, Liu K, Chen X, et al. CCGIR: Information retrieval-based code comment generation method for smart contracts. Knowledge-based Systems, 2022, 237: 107858. [doi: 10.1016/j.knosys.2021.107858

[28] Li Z, Wu YH, Peng B, et al. SeTransformer: A Transformer-based code semantic parser for code comment generation. IEEE Transactions on Reliability, 2023, 72(1): 258–273

[29] See A, Liu PJ, Manning CD. Get to the point: Summarization with pointer-generator networks. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver: Association for Computational Linguistics, 2017. 1073–1083.

[30] Loshchilov I, Hutter F. Decoupled weight decay regularization. Proceedings of the 7th International Conference on Learning Representations. New Orleans: ICLR, 2019.

[31] Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 1994, 5(2): 157–166. [doi: 10.1109/72.279181

引用本文

徐明瑞,李征,刘勇,吴永豪.基于代码语句掩码注意力机制的源代码迁移模型.计算机系统应用,2023,32(9):77-88

复制

文章指标

点击次数:559
下载次数: 1963
HTML阅读次数: 1000
引用次数: 0

历史

收稿日期:2023-02-14
最后修改日期:2023-03-14
录用日期:
在线发布日期: 2023-06-09
出版日期:

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码