本文已被:浏览 710次 下载 1571次
Received:December 07, 2021 Revised:January 04, 2022
Received:December 07, 2021 Revised:January 04, 2022
中文摘要: 针对传统的Single-Pass聚类算法对数据输入顺序过于敏感和准确率较低的问题, 提出一种以子话题为粒度, 考虑新闻文本动态性、时效性和上下文语义特征的增量文本聚类算法(SP-HTD). 首先通过解析LDA2Vec主题模型, 联合训练文档向量和词向量, 获得上下文向量, 充分挖掘文本的语义特征及重要性关系. 然后在Single-Pass算法基础上, 根据提取到的热点主题特征词, 划分子话题, 并设置时间阈值, 来确认类簇中心的时效性, 将挖掘的语义特征和任务相结合, 动态更新类簇中心. 最后以时间特性为辅, 更新话题质心向量, 提高文本相似度计算的准确性. 结果表明, 所提方法的F值最高可达89.3%, 且在保证聚类精度的前提下, 在漏检率和误检率上较传统算法有明显改善, 能够有效提高话题检测的准确性.
中文关键词: Single-Pass 文本表示 文本聚类 文本相似度 热点话题检测
Abstract:As the traditional Single-Pass clustering algorithm is highly sensitive to the input sequence of data and has low accuracy, an incremental text clustering algorithm (SP-HTD) is proposed, which takes subtopics as granularity and considers the dynamics, timeliness, and contextual semantic features of news texts. Firstly, by parsing the LDA2Vec topic model, this study jointly trains the document vectors and the word vectors to obtain the context vectors and thus fully mines the semantic features and importance relationship of the text. Then, on the basis of the Single-Pass algorithm, sub-topics are classified according to the extracted hot topic feature words, and the time threshold is set to confirm the timeliness of the cluster center. The mined semantic features and tasks are combined to dynamically update the cluster center. Finally, with the assistance of the time characteristics, the centroid vectors of the topics are updated to improve the accuracy of text similarity calculation. The results reveal that the F value of the proposed method can reach up to 89.3%, and on the premise of ensuring the clustering accuracy, the proposed method has a significantly lower undetected rate and false detection rate compared with those of the traditional algorithm, and thus it can effectively improve the accuracy of topic detection.
文章编号: 中图分类号: 文献标志码:
基金项目:国家社会科学基金(18XYY010)
引用文本:
郭莹,薛涛,胡伟华.面向热点话题检测的增量文本聚类算法.计算机系统应用,2022,31(9):280-286
GUO Ying,XUE Tao,HU Wei-Hua.Incremental Text Clustering Algorithm for Hot Topic Detection.COMPUTER SYSTEMS APPLICATIONS,2022,31(9):280-286
郭莹,薛涛,胡伟华.面向热点话题检测的增量文本聚类算法.计算机系统应用,2022,31(9):280-286
GUO Ying,XUE Tao,HU Wei-Hua.Incremental Text Clustering Algorithm for Hot Topic Detection.COMPUTER SYSTEMS APPLICATIONS,2022,31(9):280-286