###

计算机系统应用英文版:2019,28(3):172-178

View/Add Comment 过刊浏览高级检索 HTML

←前一篇 | 后一篇→

码上扫一扫！

下载全文

基于划分的海量数据相似重复记录检测

李莉, 张晓雯

(江苏大学计算机科学与通信工程学院, 镇江 212013)

Similar Duplicate Record Detection of Massive Data Based on Partition

LI Li, ZHANG Xiao-Wen

(School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China)

摘要

图/表

参考文献

相似文献

本文已被：浏览 1454次下载 2047次
Received:October 04, 2018 Revised:October 23, 2018

中文摘要: 针对目前社工库存储的海量数据，数据冗余、查询效率低下的质量问题，本文提出了一种有效的基于划分的近邻排序算法.对不同渠道采集、以不同存储方式存储的社工数据进行整合形成能以二维表形式存储的海量数据集，采用划分思想，对大数据集进行分割，形成簇；采用改进的近邻排序算法对各个簇中的小数据集进行检测得到最终的相似重复记录检测结果.实验和对比分析结果表明，划分和近邻排序算法的结合使用不仅提高了海量数据相似重复记录检测的时间效率，检测准确率也有所提升.

中文关键词: 数据质量数据清洗相似重复记录划分 SNM算法

Abstract:Aiming at solving problems of data redundancy and low query efficiency in the storage of mass social work data, this study proposed an effective partition-based neighbor sorting algorithm. The social data collected by different channels and stored in different storage methods were integrated to form a massive data set that can be stored in a two-dimensional form. The partitioning idea was used to segment the massive data set to clusters; the improved neighbor sorting algorithm was used for each cluster to obtain the final similar duplicate record detection results. The experimental and comparative analysis results show that the combination of partitioning and neighbor sorting algorithm not only improves the time efficiency of similar duplicate records detection of massive data, but also improves the detection accuracy.

keywords: data quality data cleaning similar duplicate records partition SNM algorithm

文章编号： 中图分类号： 文献标志码：

基金项目:

Author Name	Affiliation
LI Li	School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China
ZHANG Xiao-Wen	School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China

Author Name	Affiliation
LI Li	School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China
ZHANG Xiao-Wen	School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China

引用文本：
李莉,张晓雯.基于划分的海量数据相似重复记录检测.计算机系统应用,2019,28(3):172-178
LI Li,ZHANG Xiao-Wen.Similar Duplicate Record Detection of Massive Data Based on Partition.COMPUTER SYSTEMS APPLICATIONS,2019,28(3):172-178