基于Spark的K-means改进算法的并行化实现

doi:10.15888/j.cnki.csa.006296

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年4月2日 17:34 星期三

首页 > 过刊浏览>2018年第27卷第4期 >151-156. DOI:10.15888/j.cnki.csa.006296

PDF HTML阅读 XML下载导出引用引用提醒

基于Spark的K-means改进算法的并行化实现
DOI:
                        10.15888/j.cnki.csa.006296
                    
CSTR:
                        
                    
作者:
                        宋董飞宋董飞
江南大学 物联网工程学院, 无锡 214122
在期刊界中查找
在百度中查找
在本站中查找
徐华徐华
江南大学 物联网工程学院, 无锡 214122
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:江苏省自然科学基金（BK20140165）；国家留学基金委项目（201308320030）

Parallel Implementation of Improved K-means Algorithm Based on Spark

Author:

SONG Dong-Fei
SONG Dong-Fei
School of Internet of Things Engineering, Jiangnan University, Wuxi 214122, China
在期刊界中查找
在百度中查找
在本站中查找
XU Hua
XU Hua
School of Internet of Things Engineering, Jiangnan University, Wuxi 214122, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

针对传统K-means算法在处理海量数据时，存在计算复杂度高和计算能力不足等问题，提出了SKDk-means （Spark based kd-tree K-means）并行聚类算法.该算法通过引入kd-tree改善初始中心点的选择，克服传统K-means算法因初始点的不确定性，易陷入局部最优解的问题，同时利用kd-tree的最近邻搜索减少K-means在迭代中的距离计算，加快聚类速度，并在Spark平台上实现了该算法的并行化，使其适用于海量数据聚类，最后通过实验验证了算法具有良好的准确率和并行计算性能.

关键词:kd-tree;Spark;K-means;并行化;云计算

Abstract:

In view of the problems that when processing massive data the traditional K-means is highly complex and insufficient in computation, a SKDk-means (Spark based kd-tree K-means) parallel clustering algorithm has been proposed. The algorithm improves the choice of initial center point by introducing kd-tree and overcomes the problem that the traditional K-means algorithm is easy to fall into the local optimal solution due to the uncertainty of the initial point. During K-means iterative calculation, the redundant computation has been reduced and clustering speed has been accelerated by the nearest neighbor search of kd-tree. The parallelization of the algorithm is realized on the spark platform and it is applied to the massive data clustering. Finally, the experimental results show that the algorithm has good accuracy and parallel computing performance.

Key words:kd-tree;Spark;K-means;parallel;cloud computing

引用本文

宋董飞,徐华.基于Spark的K-means改进算法的并行化实现.计算机系统应用,2018,27(4):151-156

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2017-07-23
最后修改日期:2017-08-09
录用日期:
在线发布日期: 2018-04-03
出版日期:

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码