﻿ 基于人工蜂群优化的数据流聚类算法
 计算机系统应用  2020, Vol. 29 Issue (2): 145-150 PDF

Data Stream Clustering Algorithm Based on Artificial Bee Colony Optimization
JIA Dong-Li, SHEN Fei, CUI Xin-Yu
School of Information and Electrical Engineering, Hebei University of Engineering, Handan 05600038, China
Foundation item: Science and Technology Research Program of Higher Education, Hebei Province (ZD2015087); Science and Technology Research and Development Plan of Handan City (1721203049-1)
Abstract: In the traditional segmented data stream clustering algorithm, the inaccuracy of micro-cluster threshold radius T in the online part as well as the oversimplifying of the dealing process with the micro-cluster by the offline part leads to a low clustering quality. In order to break through such limitation, a data stream clustering algorithm on the basis of artificial bee colony optimization for offline part processing is proposed based on the existing dynamic sliding window model. This algorithm consists of two parts: (1) The online part dynamically adjusts the size of the window and improves the value of the micro-cluster threshold radius T according to the length of time that the data stays in the window so as to get micro clustering step by step. (2) The offline part uses the improved bee colony algorithm to continuously adjust dynamically to find the optimal clustering result. The experimental results show that this algorithm not only bears a high clustering quality, but also has fairly good ductility and stability.
Key words: data stream clustering     dynamic sliding window     artificial bee colony algorithm     micro cluster threshold radius

1 相关概念 1.1 动态滑动窗口模型

 图 1 滑动窗口模型

(1)在数据传送速度较慢时, 实际的数据流速度 ${v_t}$ 小于匀速数据流 $v$ , 即 $RW = W - \Delta w$ .

(2)在数据流的传送过程中实际速度 ${v_t}$ 与均匀速度 $v$ 相差无几, 即 $RW = W$ .

(3)在数据流的传送过程中实际的速度 ${v_t}$ 大于均匀速度 $v$ , 即 $RW = W + \Delta w$ .

1.2 微簇特征向量

 $Dis({X_a},C{F_b}) = \sqrt {\sum\limits_{i = 1}^d {{{\left(X_a^i - \frac{{F_b^i}}{n}\right)}^2}} }$ (1)

 $Dis(C{F_a},C{F_b}) = \sqrt {\sum\limits_{j = 1}^d {{{\left(\frac{{F_a^j}}{n} - \frac{{F_b^j}}{n}\right)}^2}} }$ (2)

1.3 微簇阈值半径T

(1)通过随机抽样的方法在样本数据集中选取适中规模的数据;

(2)对抽样样本两两随机为N对, 并计算每对数据间的距离;

(3)计算N对数据间的距离的期望EX和方差DX;

(4)构建阈值半径T, T=P×(EX+0.25×DX); 其中P通过统计得出取1/3时效果最优.

2 离线部分的数据流聚类优化 2.1 人工蜂群聚类优化算法思想

2.2 基本人工蜂群算法

2.3 改进的人工蜂群算法

(1)最大最小距离初始化

(2)适应度函数

 $fi{t_i} = \frac{{C{N_i}}}{{{J_i}}};i = 1,2,\cdots ,N$ (3)

(3)位置更新公式

 ${v_{id}} = {x_{id}} + \varphi ({x_{md}} - {x_{kd}}) + \theta ({x_{{\rm{best}},d}} - {x_{id}})$ (4)

2.4 离线过程中采用的计算概率
 ${P_i} = \frac{{fi{t_i}}}{{\displaystyle\sum\limits_{i = 1}^N {fi{t_i}} }};\;\;i = 1,2,\cdots,N$ (5)

3 人工蜂群优化的数据流聚类算法 3.1 在线算法

For数据集D的每个 $\scriptstyle{X_i}$

For初始每个聚类特征 $\scriptstyle C{F_k}$

根据式(1)计算 $\scriptstyle {X_i}$ $\scriptstyle C{F_k}$ 的距离, 并找出其最近的距离 $\scriptstyle Di{s_{\min }}({X_i},C{F_k})$

If( $\scriptstyle Di{s_{\min }}({X_i},C{F_k}) > {T}$ )

If(num>UB)

then{根据式(2)将距离最近的两个微簇合并; K←(K+1)}

Else{以 $\scriptstyle {X_i}$ 建立新的微簇, 并且更新微簇特征中的各项; K←(K+1)}

Else{根据式(1)把 $\scriptstyle {X_i}$ 加入与他距离最小的微簇 $\scriptstyle C{F_{\min }}$ 中}

If( $\scriptstyle {\rm{\delta}} RT-\Delta AT > {\rm{\sigma}}$ )

then { $\scriptstyle RW \leftarrow (W - \Delta w)$ } /*调整窗口大小*/

Else if( $\scriptstyle -{\rm{\sigma}} \le \Delta AT-{\rm{\delta}} RT\le {\rm{\sigma}}$ )

then { $\scriptstyle RW \leftarrow W$ }

Else { $\scriptstyle RW \leftarrow (W + \Delta w)$ }

End For

End For

3.2 离线算法

(1)对微簇集初始化并进行K-means计算得到聚类中心. (2)通过改进的蜂群算法对聚类中心进行迭代计算得到新的聚类中心并更新蜂群. (3)将K-means算法与改进的蜂群算法交替计算, 在最大迭代次数内求出最优聚类结果.

DO K-means

While(Cycle<=maxCycle) do

{Initialize( $\scriptstyle {v_{id}}$ ) /*按式(4)得到新位置 $\scriptstyle {v_{id}}$ */

计算 Maxfit( $\scriptstyle {v_i}$ ) /*根据式(3)计算并找到新蜜源的最大适应度值*/

Prob( $\scriptstyle {v_i}$ ) /*根据式(5)计算概率 $\scriptstyle {P_i}$ */

While(i<CZ/2) do

{If( $\scriptstyle {P_i}$ >rand(0, 1))

then { Initialize ( $\scriptstyle {v_{id}}$ ) }/*跟随蜂根据式(4)搜索新位置 $\scriptstyle {v_{id}}$ */

计算 Maxfit( $\scriptstyle {v_i}$ ) /*根据式(3)计算并找到新蜜源的最大适应度值*/

ii+1}

If(iter>Limit)

then { Initialize ( $\scriptstyle {v_{id}}$ ) }/*侦查蜂根据式(4)搜索新位置 $\scriptstyle {v_{id}}$ */

Else {iter←(iter+1)}

对邻域搜索到的点进行一次K-means聚类, 更新蜂群.

Cycle←(Cycle+1)}

Output聚类中心

3.3 算法分析

4 实验分析 4.1 实验数据与参数设置

4.2 实验结果与分析

 图 2 不同时间单元聚类纯度比较

 图 3 不同时间段聚类纯度比较

 图 4 第一时间段内聚类纯度对比

 图 5 不同维度数据运行时间对比

5 结论

 图 6 运行时间随簇个数的变化

 [1] Ding SF, Wu FL, Qian J, et al. Research on data stream clustering algorithms. Artificial Intelligence Review, 2015, 43(4): 593-600. DOI:10.1007/s10462-013-9398-7 [2] Aggarwal CC, Han JW, Wang JY, et al. A framework for clustering evolving data streams. Proceedings of the 29th International Conference on Very Large Data Bases. Berlin, Germany. 2003. 81–92. [3] 周华平, 陈顺生. 基于动态可调衰减滑动窗口的变速数据流聚类算法. 计算机应用与软件, 2015, 32(11): 255-260, 300. DOI:10.3969/j.issn.1000-386x.2015.11.059 [4] Fahy C, Yang SX, Gongora M. Ant colony stream clustering: A fast density clustering algorithm for dynamic data streams. IEEE Transactions on Cybernetics, 2019, 49(6): 2215-2228. DOI:10.1109/TCYB.2018.2822552 [5] 肖裕权, 周肆清. 基于粒子群优化算法的数据流聚类算法. 计算机技术与发展, 2011, 21(10): 43-46, 50. DOI:10.3969/j.issn.1673-629X.2011.10.011 [6] 张忠平, 王浩, 薛伟, 等. 动态滑动窗口的数据流聚类方法. 计算机工程与应用, 2011, 47(7): 135-138. DOI:10.3778/j.issn.1002-8331.2011.07.039 [7] Karaboga D. An idea based on honey bee swarm for numerical optimization. Kayseri, Turkey: Erciyes University, 2005. [8] 周涓, 熊忠阳, 张玉芳, 等. 基于最大最小距离法的多中心聚类算法. 计算机应用, 2006, 26(6): 1425-1427. [9] 喻金平, 郑杰, 梅宏标. 基于改进人工蜂群算法的K均值聚类算法. 计算机应用, 2014, 34(4): 1065-1069, 1088. [10] 吴建胜, 张文鹏, 马垣. KDDCUP99数据集的数据分析研究. 计算机应用与软件, 2014, 31(11): 321-325. DOI:10.3969/j.issn.1000-386x.2014.11.081 [11] 廖江陵, 管有庆. 一种改进的模糊数据流聚类算法. 计算机技术与发展, 2017, 27(11): 96-100. DOI:10.3969/j.issn.1673-629X.2017.11.021