面向热点话题检测的增量文本聚类算法

doi:10.15888/j.cnki.csa.008677

AIPUB归智期刊联盟

微信公众号

网站二维码

首页 > 过刊浏览>2022年第31卷第9期 >280-286. DOI:10.15888/j.cnki.csa.008677

PDF HTML阅读 XML下载导出引用引用提醒

面向热点话题检测的增量文本聚类算法
DOI:
                        10.15888/j.cnki.csa.008677
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家社会科学基金(18XYY010)

Incremental Text Clustering Algorithm for Hot Topic Detection

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

针对传统的Single-Pass聚类算法对数据输入顺序过于敏感和准确率较低的问题, 提出一种以子话题为粒度, 考虑新闻文本动态性、时效性和上下文语义特征的增量文本聚类算法(SP-HTD). 首先通过解析LDA2Vec主题模型, 联合训练文档向量和词向量, 获得上下文向量, 充分挖掘文本的语义特征及重要性关系. 然后在Single-Pass算法基础上, 根据提取到的热点主题特征词, 划分子话题, 并设置时间阈值, 来确认类簇中心的时效性, 将挖掘的语义特征和任务相结合, 动态更新类簇中心. 最后以时间特性为辅, 更新话题质心向量, 提高文本相似度计算的准确性. 结果表明, 所提方法的F值最高可达89.3%, 且在保证聚类精度的前提下, 在漏检率和误检率上较传统算法有明显改善, 能够有效提高话题检测的准确性.

Abstract:

As the traditional Single-Pass clustering algorithm is highly sensitive to the input sequence of data and has low accuracy, an incremental text clustering algorithm (SP-HTD) is proposed, which takes subtopics as granularity and considers the dynamics, timeliness, and contextual semantic features of news texts. Firstly, by parsing the LDA2Vec topic model, this study jointly trains the document vectors and the word vectors to obtain the context vectors and thus fully mines the semantic features and importance relationship of the text. Then, on the basis of the Single-Pass algorithm, sub-topics are classified according to the extracted hot topic feature words, and the time threshold is set to confirm the timeliness of the cluster center. The mined semantic features and tasks are combined to dynamically update the cluster center. Finally, with the assistance of the time characteristics, the centroid vectors of the topics are updated to improve the accuracy of text similarity calculation. The results reveal that the F value of the proposed method can reach up to 89.3%, and on the premise of ensuring the clustering accuracy, the proposed method has a significantly lower undetected rate and false detection rate compared with those of the traditional algorithm, and thus it can effectively improve the accuracy of topic detection.

参考文献

相似文献

引证文献

引用本文

郭莹,薛涛,胡伟华.面向热点话题检测的增量文本聚类算法.计算机系统应用,2022,31(9):280-286

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2021-12-07
最后修改日期:2022-01-04
录用日期:
在线发布日期: 2022-07-07
出版日期:

微信公众号

网站二维码

引用本文

分享

相关视频

文章指标

历史

文章二维码