本文已被:浏览 105次 下载 981次
Received:March 13, 2024 Revised:April 10, 2024
Received:March 13, 2024 Revised:April 10, 2024
中文摘要: 不平衡数据集中存在噪声和类重叠问题时, 传统分类器性能较低, 导致少数类样本难以被准确分类. 为了提高分类性能, 提出一种基于共享近邻密度峰值聚类和集成过滤机制的不平衡数据处理方法. 该方法首先利用共享近邻密度峰值聚类算法将少数类样本自适应地分为多个簇, 然后根据子簇内密度和大小分配过采样权重; 在子簇内合成时考虑使用样本的局部稀疏度和多类聚集度选择近邻样本以及确定线性插值的权重范围, 避免新样本生成于多数类聚集区域; 最后, 引入集成过滤机制剔除噪声和难以学习的边界样本以规范决策边界和提高生成样本的质量. 与5种过采样方法相比, 本文算法在8个公开数据集上整体表现更优.
Abstract:In imbalanced datasets, the presence of noise and class overlapping often leads to poor performance of traditional classifiers, resulting in minority class samples being difficult to classify accurately. To improve classification performance, a method for handling imbalanced data based on shared nearest neighbor density peak clustering and ensemble filtering mechanism is proposed. This method first uses the shared nearest neighbor density peak clustering algorithm to adaptively divide the minority class samples into multiple clusters. Then, based on the density and size within the clusters, oversampling weights are allocated to each cluster. During the synthesis within clusters, the local sparsity and clustering coefficient of the samples are considered to select neighboring samples and determine the weight range of linear interpolation, thus avoiding the generation of new samples in the majority class aggregation area. Finally, an ensemble filtering mechanism is introduced to eliminate noise and hard-to-learn boundary samples to regulate the decision boundary and improve the quality of generated samples. Compared with 5 oversampling methods, this algorithm performs better overall on 8 public datasets.
文章编号: 中图分类号: 文献标志码:
基金项目:国家自然科学基金(11801436)
引用文本:
李红玲,王彪.基于共享近邻密度峰值聚类的过采样方法.计算机系统应用,2024,33(10):245-254
LI Hong-Ling,WANG Biao.Oversampling Method Based on Shared Nearest Neighbors for Density Peak Clustering.COMPUTER SYSTEMS APPLICATIONS,2024,33(10):245-254
李红玲,王彪.基于共享近邻密度峰值聚类的过采样方法.计算机系统应用,2024,33(10):245-254
LI Hong-Ling,WANG Biao.Oversampling Method Based on Shared Nearest Neighbors for Density Peak Clustering.COMPUTER SYSTEMS APPLICATIONS,2024,33(10):245-254