Code Clone Detection Based on Token Semantics

doi:10.15888/j.cnki.csa.008783

AIPUB归智期刊联盟

WeChat

Mobile website

2025-4-8- 2

Home > Archive>Volume 31, Issue 11, 2022 >60-67. DOI:10.15888/j.cnki.csa.008783

PDF HTML XML Export Cite reminder

Code Clone Detection Based on Token Semantics
DOI:
                        10.15888/j.cnki.csa.008783
                    
CSTR:
                        [cstr]
                    
Author:
                        WANG Wen-JieWANG Wen-Jie
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China;Key Laboratory of High Performance Computing of Anhui Province, University of Science and Technology of China, Hefei 230027, China
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
XU YunXU Yun
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China;Key Laboratory of High Performance Computing of Anhui Province, University of Science and Technology of China, Hefei 230027, China
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Traditional token-based clone detection methods utilize the serialization characteristics of code strings to quickly detect clones in large code repositories. However, compared with the methods based on the abstract syntax tree (AST) and program dependency graph (PDG), traditional methods can hardly detect code clones with large text differences due to the lack of syntax and semantic information. Therefore, this study proposes a token-based clone detection method with semantic information. First, AST is analyzed, and the semantic information of tokens located at the leaf nodes is abstracted using the AST path. Then, a low-cost index is established on the tokens for function names and type roles to quickly filter valid candidate clone fragments. Finally, the similarity between code blocks is judged using the tokens with semantic information. The experimental results on the public large-scale dataset BigCloneBench reveal that this method significantly outperforms the mainstream methods, including NiCad, Deckard, and CCAligner in Moderately Type-3 and Weakly Type-3/Type-4 clones with low text similarity while requiring less detection time on large code repositories.

Key words:code clone detection;abstract syntax tree;semantic information;efficient index;source code

Get Citation

王文杰,徐云.基于Token语义构建的代码克隆检测.计算机系统应用,2022,31(11):60-67

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:February 24,2022
Revised:March 28,2022
Adopted:
Online: July 14,2022
Published:

Article QR Code

You are the first990787Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address：4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code：100190
Phone：010-62661041 Fax： Email：csa (a) iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063