面向热点话题检测的增量文本聚类算法
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家社会科学基金(18XYY010)


Incremental Text Clustering Algorithm for Hot Topic Detection
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 增强出版
  • |
  • 文章评论
    摘要:

    针对传统的Single-Pass聚类算法对数据输入顺序过于敏感和准确率较低的问题, 提出一种以子话题为粒度, 考虑新闻文本动态性、时效性和上下文语义特征的增量文本聚类算法(SP-HTD). 首先通过解析LDA2Vec主题模型, 联合训练文档向量和词向量, 获得上下文向量, 充分挖掘文本的语义特征及重要性关系. 然后在Single-Pass算法基础上, 根据提取到的热点主题特征词, 划分子话题, 并设置时间阈值, 来确认类簇中心的时效性, 将挖掘的语义特征和任务相结合, 动态更新类簇中心. 最后以时间特性为辅, 更新话题质心向量, 提高文本相似度计算的准确性. 结果表明, 所提方法的F值最高可达89.3%, 且在保证聚类精度的前提下, 在漏检率和误检率上较传统算法有明显改善, 能够有效提高话题检测的准确性.

    Abstract:

    As the traditional Single-Pass clustering algorithm is highly sensitive to the input sequence of data and has low accuracy, an incremental text clustering algorithm (SP-HTD) is proposed, which takes subtopics as granularity and considers the dynamics, timeliness, and contextual semantic features of news texts. Firstly, by parsing the LDA2Vec topic model, this study jointly trains the document vectors and the word vectors to obtain the context vectors and thus fully mines the semantic features and importance relationship of the text. Then, on the basis of the Single-Pass algorithm, sub-topics are classified according to the extracted hot topic feature words, and the time threshold is set to confirm the timeliness of the cluster center. The mined semantic features and tasks are combined to dynamically update the cluster center. Finally, with the assistance of the time characteristics, the centroid vectors of the topics are updated to improve the accuracy of text similarity calculation. The results reveal that the F value of the proposed method can reach up to 89.3%, and on the premise of ensuring the clustering accuracy, the proposed method has a significantly lower undetected rate and false detection rate compared with those of the traditional algorithm, and thus it can effectively improve the accuracy of topic detection.

    参考文献
    相似文献
    引证文献
引用本文

郭莹,薛涛,胡伟华.面向热点话题检测的增量文本聚类算法.计算机系统应用,2022,31(9):280-286

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-12-07
  • 最后修改日期:2022-01-04
  • 录用日期:
  • 在线发布日期: 2022-07-07
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号