针对舆情数据的去重算法
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:


Duplicate Removal Algorithm for Public Opinion
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    针对在数据服务中舆情去重不可避免且缺乏理论指导的问题,通过研究SimHash、MinHash、Jaccard、Cosine Similarty经典去重算法,以及常见的分词和特征选择算法,以寻求表现优异的算法搭配,并对传统Jaccard和SimHash进行了改进分别产生新算法:基于短文章的Jaccard和基于Cosine Distance的SimHash.针对比较对象众多实验效率低下的问题,提出了先纵向比较筛选出优势算法,然后横向比较获得最佳搭配,最后综合比较的策略,并结合3000舆情样本实验证明:改进的SimHash比传统的SimHash具有更高的精度和召回率;改进的Jaccard较传统Jaccard,召回率提高了17%,效率提高了50%;MinHash+结巴全模式分词和Jaccard+IKAnalyzer智能分词在保持精度高于96%的条件下,都具有75%以上的高召回率,且稳定性很好.其中MinHash去重效果略低于Jaccard,但特征比较时间较短,综合表现最好.

    Abstract:

    In big data services, duplicate removal of public opinion information is inevitable, and it lacks theoretical guidance. There is a research on the classical duplicate removal algorithm such as SimHash, MinHash, Jaccard, Cosine Similarty, as well as common segmentation algorithm and feature selection algorithm in order to seek excellent performance of the algorithm. The Jaccard based on short article and the SimHash algorithm based on Cosine Distance are proposed to improve the traditional algorithms. Aiming at the problem of the low efficiency of experiment on many research subjects, the strategy is adopted that filters out algorithm of obvious advantages by vertical comparison firstly, and gets the most appropriate algorithm collocation by horizontal comparison secondly, at last, makes a comprehensive comparison. The experiment of 3000 public opinion samples shows that improved SimHash has better effect than traditional SimHash; improved Jaccard increases the recall rate by 17% and improves the efficiency by 50% compared with traditional Jaccard. Under the condition that the accuracy is higher than 96%, MinHash+Jieba full pattern word segmentation and Jaccard+IKAnalyzer intelligent word segmentation has more than 75% recall rate and good stability. MinHash is a bit weak than Jaccard in the aspect of removal effect, yet has the best comprehensive performance and shorter feature comparison time.

    参考文献
    相似文献
    引证文献
引用本文

张庆梅.针对舆情数据的去重算法.计算机系统应用,2017,26(5):16-22

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2016-08-28
  • 最后修改日期:2016-09-27
  • 录用日期:
  • 在线发布日期: 2017-05-13
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号