基于划分的海量数据相似重复记录检测

doi:10.15888/j.cnki.csa.006835

微信公众号

网站二维码

首页 > 过刊浏览>2019年第28卷第3期 >172-178. DOI:10.15888/j.cnki.csa.006835

PDF HTML阅读 XML下载导出引用引用提醒

基于划分的海量数据相似重复记录检测
DOI:
                        10.15888/j.cnki.csa.006835
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:

Similar Duplicate Record Detection of Massive Data Based on Partition

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

针对目前社工库存储的海量数据，数据冗余、查询效率低下的质量问题，本文提出了一种有效的基于划分的近邻排序算法.对不同渠道采集、以不同存储方式存储的社工数据进行整合形成能以二维表形式存储的海量数据集，采用划分思想，对大数据集进行分割，形成簇；采用改进的近邻排序算法对各个簇中的小数据集进行检测得到最终的相似重复记录检测结果.实验和对比分析结果表明，划分和近邻排序算法的结合使用不仅提高了海量数据相似重复记录检测的时间效率，检测准确率也有所提升.

Abstract:

Aiming at solving problems of data redundancy and low query efficiency in the storage of mass social work data, this study proposed an effective partition-based neighbor sorting algorithm. The social data collected by different channels and stored in different storage methods were integrated to form a massive data set that can be stored in a two-dimensional form. The partitioning idea was used to segment the massive data set to clusters; the improved neighbor sorting algorithm was used for each cluster to obtain the final similar duplicate record detection results. The experimental and comparative analysis results show that the combination of partitioning and neighbor sorting algorithm not only improves the time efficiency of similar duplicate records detection of massive data, but also improves the detection accuracy.

参考文献

相似文献

引证文献

引用本文

李莉,张晓雯.基于划分的海量数据相似重复记录检测.计算机系统应用,2019,28(3):172-178

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2018-10-04
最后修改日期:2018-10-23
录用日期:
在线发布日期: 2019-02-22
出版日期:

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码