Abstract:In view of the problems that traditional k-means clustering algorithm faces in dealing with mass data, such as running out of memory, the operating in slow speed and so on, this paper proposes a parallel k-means algorithm based on MapReduce. At the same time, in order to overcome the blindness of the k-means algorithm in terms of determining the initial value, we use the canopy algorithm to improve the insufficient. The experimental results show that the parallel k-means algorithm based on MapReduce has an effect on clustering before and after the improvement, not only the quality of the clustering has been increased, but in terms of processing large datasets. The speed-up ratio of the improved algorithm can get closer to the linear.