Abstract:In view of the problems that when processing massive data the traditional K-means is highly complex and insufficient in computation, a SKDk-means (Spark based kd-tree K-means) parallel clustering algorithm has been proposed. The algorithm improves the choice of initial center point by introducing kd-tree and overcomes the problem that the traditional K-means algorithm is easy to fall into the local optimal solution due to the uncertainty of the initial point. During K-means iterative calculation, the redundant computation has been reduced and clustering speed has been accelerated by the nearest neighbor search of kd-tree. The parallelization of the algorithm is realized on the spark platform and it is applied to the massive data clustering. Finally, the experimental results show that the algorithm has good accuracy and parallel computing performance.