基于二阶近邻的异常检测
作者:
基金项目:

国家自然科学基金 (61976195)


Anomaly Detection Based on Second-order Proximity
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [26]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    对盈千累万且错综复杂的数据集进行分析, 是一个非常具有挑战性的任务, 检测数据中的异常值的技术在该任务中发挥着举足轻重的作用. 通过聚类捕获异常的方式, 在日趋流行的异常检测技术中是最为常用的一类方法. 文中提出了一种基于二阶近邻的异常检测算法(anomaly detection based second-order proximity, SOPD), 主要包括聚类和异常检测两个阶段. 在聚类过程中, 通过二阶近邻的方式获取相似性矩阵; 在异常检测过程中, 根据簇中的点与簇中心的关系, 计算聚类生成的每一个簇中的所有的点与该簇中心的距离, 捕捉异常状态, 并把每个数据点的密度考虑进去, 排除簇边界情况. 二阶近邻的使用, 使得数据的局部性以及全局性得以被同时考虑, 进而使得聚类得到的簇数减少, 增加了异常检测的精确性. 通过大量实验, 将该算法与一些经典的异常检测算法进行比较, 结果表明, SOPD算法整体上性能较好.

    Abstract:

    The analysis of numerous and intricate data sets is a highly challenging task, in which the technique to detect outliers in data plays a pivotal role. Capturing anomalies by clustering is the most common method among the increasingly popular anomaly detection techniques. This study proposes an anomaly detection algorithm based on second-order proximity (SOPD), which includes clustering and anomaly detection stages. During clustering, the similarity matrix is obtained by second-order proximity. During anomaly detection, the relationships between points in the cluster and the center of the cluster are employed to calculate the distance of all the points in each cluster generated by clustering from the center of the cluster and capture the anomalous state. The density of each data point is also taken into account to exclude the cases of cluster boundaries. The use of second-order proximity enables the locality and globality of the data to be considered simultaneously, which reduces the number of the obtained clusters and increases the accuracy of anomaly detection. Moreover, this study compares this algorithm with some classical anomaly detection algorithms through massive experiments, and the result shows that the SOPD-based algorithm performs well overall.

    参考文献
    [1] Saxena S, Rajpoot DS. Density-based approach for outlier detection and removal. Proceedings of 2018 International Conference on Signal Processing and Communication. Singapore: Springer, 2019. 281–291.
    [2] Liu HW, Li EH, Liu XW, et al. Anomaly detection with kernel preserving embedding. ACM Transactions on Knowledge Discovery from Data, 2021, 15(5): 91. [doi: 10.1145/3447684
    [3] Liu TF, Gao H, Wu JJ. Review of outlier detection algorithms based on grain storage temperature data. Proceedings of 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA). Dalian: IEEE, 2020. 1045–1048.
    [4] Wahid A, Rao ACS. An outlier detection algorithm based on KNN-kernel density estimation. Proceedings of 2020 International Joint Conference on Neural Networks (IJCNN). Glasgow: IEEE, 2020. 1–8.
    [5] Siegel AF. Statistics and Data Analysis: An Introduction. New York: John Wiley & Sons, 1988.
    [6] Laurikkala J, Juhola M, Kentala E. Informal identification of outliers in medical data. Proceedings of 5th International Workshop on Intelligent Data Analysis in Medicine and Pharmacology. Berlin: ECCAI, 2000. 20–24.
    [7] Angiulli F, Fassetti F. DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets. ACM Transactions on Knowledge Discovery from Data, 2009, 3(1): 4. [doi: 10.1145/1497577.1497581
    [8] Zhang K, Hutter M, Jin HD. A new local distance-based outlier detection approach for scattered real-world data. Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Bangkok: Springer, 2009. 813–822.
    [9] Angiulli F, Basta S, Lodi S, et al. Reducing distance computations for distance-based outliers. Expert Systems with Applications, 2020, 147: 113215. [doi: 10.1016/j.eswa.2020.113215
    [10] Breunig MM, Kriegel HP, Ng RT, et al. LOF: Identifying density-based local outliers. ACM SIGMOD Record, 2000, 29(2): 93–104. [doi: 10.1145/335191.335388
    [11] Jin W, Tung AKH, Han JW, et al. Ranking outliers using symmetric neighborhood relationship. Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Singapore: Springer, 2006. 577–593.
    [12] Riahi-Madvar M, Azirani AA, Nasersharif B, et al. A new density-based subspace selection method using mutual information for high dimensional outlier detection. Knowledge-based Systems, 2021, 216: 106733. [doi: 10.1016/j.knosys.2020.106733
    [13] Pu G, Wang LJ, Shen J, et al. A hybrid unsupervised clustering-based anomaly detection method. Tsinghua Science and Technology, 2021, 26(2): 146–153. [doi: 10.26599/TST.2019.9010051
    [14] Ester M, Kriegel HP, Sander J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland: AAAI Press, 1996. 226–231.
    [15] Huang JL, Zhu QS, Yang LJ, et al. A novel outlier cluster detection algorithm without top-n parameter. Knowledge-Based Systems, 2017, 121: 32–40. [doi: 10.1016/j.knosys.2017.01.013
    [16] Nozad N, Ahmad S, Haeri A, et al. SDCOR: Scalable density-based clustering for local outlier detection in massive-scale datasets. Knowledge-based Systems, 2021, 228: 107256. [doi: 10.1016/j.knosys.2021.107256
    [17] Gao JH, Ji WX, Zhang LL, et al. Cube-based incremental outlier detection for streaming computing. Information Sciences, 2020, 517: 361–376. [doi: 10.1016/j.ins.2019.12.060
    [18] Bansal M, Sharma D. A novel multi-view clustering approach via proximity-based factorization targeting structural maintenance and sparsity challenges for text and image categorization. Information Processing & Management, 2021, 58(4): 102546. [doi: 10.1016/j.ipm.2021.102546
    [19] Frey BJ, Dueck D. Clustering by passing messages between data points. Science, 2007, 315(5814): 972–976. [doi: 10.1126/science.1136800
    [20] Janssens JHM, Huszár F, Postma EO, et al. Stochastic outlier selection. Technical Report. Tilburg: Tilburg University, 2012.
    [21] Shyu ML, Chen SC, Sarinnapakorn K, et al. A novel anomaly detection scheme based on principal component classifier. Proceedings of IEEE Foundations and New Directions of Data Mining Workshop, in Conjunction with the 3rd IEEE International Conference on Data Mining (ICDM). IEEE, 2003. 172–179.
    [22] Arning A, Agrawal R, Raghavan P. A linear method for deviation detection in large databases. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland: AAAI Press, 1996. 164–169.
    [23] Kriegel HP, Kr?ger P, Schubert E, et al. Outlier detection in axis-parallel subspaces of high dimensional data. Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Bangkok: Springer, 2009. 831–838.
    [24] Sch?lkopf B, Platt JC, Shawe-Taylor J, et al. Estimating the support of a high-dimensional distribution. Neural Computation, 2001, 13(7): 1443–1471. [doi: 10.1162/089976601750264965
    [25] Papadimitriou S, Kitagawa H, Gibbons PB, et al. LOCI: Fast outlier detection using the local correlation integral. Proceedings of the 19th International Conference on Data Engineering. Bangalore: IEEE, 2003. 315–326.
    [26] Zhao Y, Nasrullah Z, Li Z. PyOD: A Python toolbox for scalable outlier detection. Journal of Machine Learning Research (JMLR), 2019, 20(96): 1–7
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

卢梦茹,周昌军,刘华文,徐晓丹.基于二阶近邻的异常检测.计算机系统应用,2023,32(2):160-169

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2022-07-14
  • 最后修改日期:2022-09-07
  • 在线发布日期: 2022-11-29
文章二维码
您是第12459985位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号