###
计算机系统应用英文版:2019,28(3):172-178
本文二维码信息
码上扫一扫!
基于划分的海量数据相似重复记录检测
(江苏大学 计算机科学与通信工程学院, 镇江 212013)
Similar Duplicate Record Detection of Massive Data Based on Partition
(School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China)
摘要
图/表
参考文献
相似文献
本文已被:浏览 1454次   下载 2047
Received:October 04, 2018    Revised:October 23, 2018
中文摘要: 针对目前社工库存储的海量数据,数据冗余、查询效率低下的质量问题,本文提出了一种有效的基于划分的近邻排序算法.对不同渠道采集、以不同存储方式存储的社工数据进行整合形成能以二维表形式存储的海量数据集,采用划分思想,对大数据集进行分割,形成簇;采用改进的近邻排序算法对各个簇中的小数据集进行检测得到最终的相似重复记录检测结果.实验和对比分析结果表明,划分和近邻排序算法的结合使用不仅提高了海量数据相似重复记录检测的时间效率,检测准确率也有所提升.
Abstract:Aiming at solving problems of data redundancy and low query efficiency in the storage of mass social work data, this study proposed an effective partition-based neighbor sorting algorithm. The social data collected by different channels and stored in different storage methods were integrated to form a massive data set that can be stored in a two-dimensional form. The partitioning idea was used to segment the massive data set to clusters; the improved neighbor sorting algorithm was used for each cluster to obtain the final similar duplicate record detection results. The experimental and comparative analysis results show that the combination of partitioning and neighbor sorting algorithm not only improves the time efficiency of similar duplicate records detection of massive data, but also improves the detection accuracy.
文章编号:     中图分类号:    文献标志码:
基金项目:
引用文本:
李莉,张晓雯.基于划分的海量数据相似重复记录检测.计算机系统应用,2019,28(3):172-178
LI Li,ZHANG Xiao-Wen.Similar Duplicate Record Detection of Massive Data Based on Partition.COMPUTER SYSTEMS APPLICATIONS,2019,28(3):172-178