本文已被:浏览 231次 下载 758次
Received:January 14, 2024 Revised:February 07, 2024
Received:January 14, 2024 Revised:February 07, 2024
中文摘要: 跨项目缺陷预测(cross-project defect prediction, CPDP)已经成为软件工程和数据挖掘领域的一个重要研究方向, 利用其他数据丰富项目的缺陷代码来建立预测模型, 解决了模型构建过程中的数据不足问题. 然而源项目和目标项目的代码文件之间存在的分布差异, 导致跨项目预测效果不佳. 大多数研究采用域适应方法来解决这一问题, 但是现有的方法一方面只考虑了条件分布或边缘分布对缺陷预测的影响, 忽视了其动态性; 另一方面没有选择合适的伪标签. 基于上述两个方面, 本文提出了一种基于动态分布对齐和伪标签学习的跨项目缺陷预测方法(DPLD). 具体来说, 我们通过对抗域适应方法分别在域对齐和类别对齐模块中减小项目间的边缘分布差异和条件分布差异, 并借助动态分布因子动态、定量地描述了两种分布的相对重要性. 此外, 本文也提出了一种伪标签学习方法, 通过数据间的几何相似性来增强伪标签作为真实标签的准确性. 本文在PROMISE数据集上进行了实验, F-measure和AUC的值分别提升了22.98%、15.21%, 表明了本文方法在减小项目间分布差异、提升跨项目缺陷预测性能上的有效性.
Abstract:Cross-project defect prediction (CPDP) has emerged as a crucial research area in software engineering and data mining. Using defective code from other data-rich projects to build prediction models solves the problem of insufficient data during model construction. However, the distribution difference between the code files of source and target projects results in poor cross-project prediction. Most studies adopt the domain adaptation methods to solve this problem, but the existing methods only focus on the influence of conditional or marginal distribution on domain adaptation, ignoring its dynamics. On the other hand, they fail to choose appropriate pseudo-labels. Based on the above two aspects, this study proposes a cross-project defect prediction method based on dynamic distribution alignment and pseudo-label learning (DPLD). Specifically, the proposed method reduces the marginal and conditional distribution differences between projects in the domain alignment and category alignment modules, respectively, by means of the adversarial domain adaptation method. Additionally, it dynamically and quantitatively characterizes the relative importance of the two distributions using dynamic distribution factors. Furthermore, this study proposes a pseudo-label learning method to enhance the accuracy of pseudo-labels as real labels through the geometric similarity between data. Experiments conducted on the PROMISE dataset show that DPLD achieves average improvements of 22.98% and 15.21% in terms of F-measure and AUC, respectively. These results demonstrate the effectiveness of the DPLD method in reducing distribution differences between projects and improving the performance of cross-project defect prediction.
keywords: domain adaption cross-project defect prediction conditional distribution marginal distribution pseudo-label learning
文章编号: 中图分类号: 文献标志码:
基金项目:国家自然科学基金(62172249); 中央高校基本科研业务费专项资金(93K172022K01)
引用文本:
高芹芹,凌松松,于婕,于旭.基于动态分布对齐和伪标签学习的跨项目缺陷预测.计算机系统应用,2024,33(8):40-50
GAO Qin-Qin,LING Song-Song,YU Jie,YU Xu.Cross-project Defect Prediction Based on Dynamic Distribution Alignment and Pseudo-label Learning.COMPUTER SYSTEMS APPLICATIONS,2024,33(8):40-50
高芹芹,凌松松,于婕,于旭.基于动态分布对齐和伪标签学习的跨项目缺陷预测.计算机系统应用,2024,33(8):40-50
GAO Qin-Qin,LING Song-Song,YU Jie,YU Xu.Cross-project Defect Prediction Based on Dynamic Distribution Alignment and Pseudo-label Learning.COMPUTER SYSTEMS APPLICATIONS,2024,33(8):40-50