Abstract:In the field of initialization of clustering for large data set, random sampling is used as an important reduction operation. This paper focuses on the process and property of random sampling, and proposes a novel random sampling method which is based on KD-Tree samples. Sample spaces were further divided into several sub spaces using KD-Tree. KD-Tree samples were created for each sub-space. This overcomes the defect of skewness of the random samples. Thus the good initial centroids can well describe the clustering category of the whole data set. The experiment results show that the cluster initial centroids selected by the new method is more closed to the desired cluster centers, and the better clustering accuracy can be achieved.