本文已被:浏览 918次 下载 2335次
Received:February 24, 2022 Revised:March 28, 2022
Received:February 24, 2022 Revised:March 28, 2022
中文摘要: 传统的基于Token的克隆检测方法利用代码字符串的序列化特性, 可以在大型代码仓中快速检测克隆. 但是与基于抽象语法树(AST)、程序依赖图(PDG)的方法相比, 由于缺少语法及语义信息, 针对文本有较大差异的克隆代码检测困难. 为此, 提出一种赋予语义信息的Token克隆检测方法. 首先, 分析抽象语法树, 使用AST路径抽象位于叶子节点的Token的语义信息; 然后, 在函数名和类型名角色的Token上建立低成本索引, 达到快速并有效地筛选候选克隆片段的目的. 最后, 使用赋予语义信息的Token判定代码块之间的相似性. 在公开的大规模数据集BigCloneBench实验结果表明, 该方法在文本相似度较低的Moderately Type-3和Weakly Type-3/Type-4类型克隆上显著优于主流方法, 包括NiCad、Deckard、CCAligner等, 同时在大型代码仓上需要更少的检测时间.
Abstract:Traditional token-based clone detection methods utilize the serialization characteristics of code strings to quickly detect clones in large code repositories. However, compared with the methods based on the abstract syntax tree (AST) and program dependency graph (PDG), traditional methods can hardly detect code clones with large text differences due to the lack of syntax and semantic information. Therefore, this study proposes a token-based clone detection method with semantic information. First, AST is analyzed, and the semantic information of tokens located at the leaf nodes is abstracted using the AST path. Then, a low-cost index is established on the tokens for function names and type roles to quickly filter valid candidate clone fragments. Finally, the similarity between code blocks is judged using the tokens with semantic information. The experimental results on the public large-scale dataset BigCloneBench reveal that this method significantly outperforms the mainstream methods, including NiCad, Deckard, and CCAligner in Moderately Type-3 and Weakly Type-3/Type-4 clones with low text similarity while requiring less detection time on large code repositories.
keywords: code clone detection abstract syntax tree semantic information efficient index source code
文章编号: 中图分类号: 文献标志码:
基金项目:国家自然科学基金(61672480); 国家外专局111引智计划(BP0719016)
引用文本:
王文杰,徐云.基于Token语义构建的代码克隆检测.计算机系统应用,2022,31(11):60-67
WANG Wen-Jie,XU Yun.Code Clone Detection Based on Token Semantics.COMPUTER SYSTEMS APPLICATIONS,2022,31(11):60-67
王文杰,徐云.基于Token语义构建的代码克隆检测.计算机系统应用,2022,31(11):60-67
WANG Wen-Jie,XU Yun.Code Clone Detection Based on Token Semantics.COMPUTER SYSTEMS APPLICATIONS,2022,31(11):60-67