Similar Duplicate Record Detection of Massive Data Based on Partition

doi:10.15888/j.cnki.csa.006835

AIPUB归智期刊联盟

WeChat

Mobile website

2025-4-24- 21

Home > Archive>Volume 28, Issue 3, 2019 >172-178. DOI:10.15888/j.cnki.csa.006835

PDF HTML XML Export Cite reminder

Similar Duplicate Record Detection of Massive Data Based on Partition
DOI:
                        10.15888/j.cnki.csa.006835
                    
CSTR:
                        [cstr]
                    
Author:
                        LI LiLI Li
School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
ZHANG Xiao-WenZHANG Xiao-Wen
School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Aiming at solving problems of data redundancy and low query efficiency in the storage of mass social work data, this study proposed an effective partition-based neighbor sorting algorithm. The social data collected by different channels and stored in different storage methods were integrated to form a massive data set that can be stored in a two-dimensional form. The partitioning idea was used to segment the massive data set to clusters; the improved neighbor sorting algorithm was used for each cluster to obtain the final similar duplicate record detection results. The experimental and comparative analysis results show that the combination of partitioning and neighbor sorting algorithm not only improves the time efficiency of similar duplicate records detection of massive data, but also improves the detection accuracy.

Key words:data quality;data cleaning;similar duplicate records;partition;SNM algorithm

Get Citation

李莉,张晓雯.基于划分的海量数据相似重复记录检测.计算机系统应用,2019,28(3):172-178

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:October 04,2018
Revised:October 23,2018
Adopted:
Online: February 22,2019
Published:

Article QR Code

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address：4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code：100190
Phone：010-62661041 Fax： Email：csa (a) iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063