Duplicate Removal Algorithm for Public Opinion
CSTR:
Author:
Affiliation:

Clc Number:

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    In big data services, duplicate removal of public opinion information is inevitable, and it lacks theoretical guidance. There is a research on the classical duplicate removal algorithm such as SimHash, MinHash, Jaccard, Cosine Similarty, as well as common segmentation algorithm and feature selection algorithm in order to seek excellent performance of the algorithm. The Jaccard based on short article and the SimHash algorithm based on Cosine Distance are proposed to improve the traditional algorithms. Aiming at the problem of the low efficiency of experiment on many research subjects, the strategy is adopted that filters out algorithm of obvious advantages by vertical comparison firstly, and gets the most appropriate algorithm collocation by horizontal comparison secondly, at last, makes a comprehensive comparison. The experiment of 3000 public opinion samples shows that improved SimHash has better effect than traditional SimHash; improved Jaccard increases the recall rate by 17% and improves the efficiency by 50% compared with traditional Jaccard. Under the condition that the accuracy is higher than 96%, MinHash+Jieba full pattern word segmentation and Jaccard+IKAnalyzer intelligent word segmentation has more than 75% recall rate and good stability. MinHash is a bit weak than Jaccard in the aspect of removal effect, yet has the best comprehensive performance and shorter feature comparison time.

    Reference
    Related
    Cited by
Get Citation

张庆梅.针对舆情数据的去重算法.计算机系统应用,2017,26(5):16-22

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:August 28,2016
  • Revised:September 27,2016
  • Adopted:
  • Online: May 13,2017
  • Published:
Article QR Code
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address:4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code:100190
Phone:010-62661041 Fax: Email:csa (a) iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063