本文已被:浏览 2033次 下载 2189次
Received:October 15, 2018 Revised:October 31, 2018
Received:October 15, 2018 Revised:October 31, 2018
中文摘要: 短文本聚类一直是信息提取领域的热门话题,大规模的短文本数据中存在“长尾现象”,传统算法对其聚类时会面临特征纬度高,小类别信息丢失的问题,针对对上述问题的研究,本文提出一种频繁项协同剪枝迭代聚类算法(Frequent Itemsets collaborative Pruning iteration Clustering framework,FIPC).该算法将迭代聚类框架与K中心点算法相结合,运用协同剪枝策略,实现对小类别文本聚类,实验结果证明该聚类算法能够有效的提高小类别短文本信息聚类的精确度,并能避免聚类中类簇重叠的问题.
Abstract:Short texts clustering is a popular topic in the field of information extraction. There is a "long tail phenomenon" when the scale of data is large, which causes high dimensions of features and information loss of small class. To solve these problems, this study proposes a Frequent Itemsets collaborative Pruning iteration Clustering framework (FIPC). This framework combines the iterative clustering framework with the K-mediods algorithm, using the collaborative pruning strategy to cluster text of small class. The result of experiments shows that the FIPC framework can achieve text clustering of small class with high accuracy, and avoid the problem of overlapping clusters.
文章编号: 中图分类号: 文献标志码:
基金项目:国家科技支撑计划项目子课题(2015BAD29B01);农业部软科学研究课题(D201721);中央高校基本科研业务费专项资金(CZY18016)
引用文本:
宋中山,张广凯,尹帆,帖军.基于频繁模式的长尾文本聚类算法.计算机系统应用,2019,28(4):139-144
SONG Zhong-Shan,ZHANG Guang-Kai,YIN Fan,TIE Jun.Long Tail Text Clustering Algorithm Based on Frequent Patterns.COMPUTER SYSTEMS APPLICATIONS,2019,28(4):139-144
宋中山,张广凯,尹帆,帖军.基于频繁模式的长尾文本聚类算法.计算机系统应用,2019,28(4):139-144
SONG Zhong-Shan,ZHANG Guang-Kai,YIN Fan,TIE Jun.Long Tail Text Clustering Algorithm Based on Frequent Patterns.COMPUTER SYSTEMS APPLICATIONS,2019,28(4):139-144