面向软件缺陷预测的过采样方法
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:


Oversampling Method for Software Defect Prediction
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 增强出版
  • |
  • 文章评论
    摘要:

    为了缓解软件缺陷预测的类不平衡问题,避免过拟合影响缺陷预测模型的准确率,本文提出一种面向软件缺陷预测的基于异类距离排名的过采样方法(HDR).首先,对少数类实例进行3类实例区分,去除噪声实例,减少噪声数据导致的过拟合的情况,然后基于异类距离将实例进行排名,选取相似度高的实例两两组合产生新实例,以此来提升新实例的多样性,之后将有价值的被删除的少数类实例恢复.实验将HDR算法与SMOTE算法和Borderline-SMOTE算法进行比较,采用RF分类器在NASA的8个实际项目数据集上进行,结果显示在F1-measure和G-Mean两项指标上分别有7.7%和10.6%的性能提升,实验表明HDR算法在处理数据量大并且不平衡率高的软件缺陷预测数据集上明显优于其他两种算法.

    Abstract:

    To alleviate the class imbalance problem of software defect prediction and avoid the influence of overfitting on the accuracy of the defect prediction model, this study proposes an oversampling method for software defect prediction based on heterogeneous distance ranking (HDR). First, a minority of instances are distinguished by three classes to remove noise instances and reduce overfitting caused by noise data. Then, instances are ranked based on heterogeneous distances and paired with highly similar ones to generate new instances for the improvement of new instance diversity. Valuable minority instances that were deleted are restored afterward. The experiment compares the HDR algorithm with the SMOTE and the Borderline-SMOTE algorithms, and the RF classifier is used on the eight actual project data sets of NASA. The results show that there are 7.7% and 10.6% performance improvements on the F1-measure and G-Mean indicators respectively. Experimental results show that the HDR algorithm is significantly better than other algorithms in processing software defect prediction data sets with large data volumes and high imbalance rates.

    参考文献
    相似文献
    引证文献
引用本文

纪兴哲,邵培南.面向软件缺陷预测的过采样方法.计算机系统应用,2022,31(1):242-248

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-04-01
  • 最后修改日期:2021-04-29
  • 录用日期:
  • 在线发布日期: 2021-12-17
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号