###
计算机系统应用英文版:2022,31(2):267-272
本文二维码信息
码上扫一扫!
BERT与GSDMM融合的聚类短文本分类
(1.中国科学技术大学 管理学院 统计与金融系, 合肥 230041;2.中国科学技术大学 管理学院 国际金融研究院, 合肥 230041)
Clustering Short Text Classification Based on Fusion of BERT and GSDMM
(1.Department of Statistics and Finance, School of Management, University of Science and Technology of China, Hefei 230041, China;2.International Institute of Finance, School of Management, University of Science and Technology of China, Hefei 230041, China)
摘要
图/表
参考文献
相似文献
本文已被:浏览 733次   下载 1213
Received:April 06, 2021    Revised:April 30, 2021
中文摘要: 在文本分类任务中, 由于短文本具有特征稀疏, 用词不规范等特点, 传统的自然语言处理方法在短文本分类中具有局限性. 针对短文本的特点, 本文提出一种基于BERT (bidirectional encoder representations from Transformers)与GSDMM (collapsed Gibbs sampling algorithm for the Dirichlet multinomial mixture model)融合和聚类指导的短文本分类算法, 用以提高短文本分类有效性与准确性. 本算法一方面通过BERT与GSDMM融合模型将短文本转化为集成语义向量, 集成的向量体现了全局语义特征与主题特征, 解决了短文本特征稀疏与主题信息匮乏的问题. 另一方面在分类器前端训练中通过引入聚类指导算法实现对标注数据的扩展, 同时也提升了结果的可解释性. 最后利用扩展后的标注数据集训练分类器完成对短文本的自动化分类. 将电商平台的差评数据作为验证数据集, 在多组对比实验中验证了本算法在短文本分类方面应用的有效性与优势.
中文关键词: GSDMM  BERT  SVM  短文本分类  聚类指导  语义向量
Abstract:In the task of text classification, traditional natural language processing methods have limitations in short text classification due to the sparse features and irregular wording of short texts. Considering the characteristics of short texts, this study proposes a classification algorithm based on the fusion of bidirectional encoder representations from Transformers (BERT) and a collapsed Gibbs sampling algorithm for the Dirichlet multinomial mixture model (GSDMM) and clustering guidance to improve the effectiveness and accuracy of short text classification. First, the model converts short texts into integrated semantic vectors by using the fusion model of BERT and GSDMM. The integrated vectors reflect global semantic features and topic features and solve the problems of sparse short text features and the lack of topic information. Then, the clustering guidance algorithm is introduced into the front-end training of the classifier, which realizes the expansion of the labeled data and improves the interpretability of the results. Finally, the expanded labeled data set is used to train the classifier to complete the automatic classification of short texts. Taking the negative comment of an e-commerce platform as the verification data set, this study verifies the effectiveness and advantages of the algorithm in short text classification in multiple groups of comparative experiments.
文章编号:     中图分类号:    文献标志码:
基金项目:安徽省自然科学基金青年项目(1908085AG299)
引用文本:
刘豪,王雨辰.BERT与GSDMM融合的聚类短文本分类.计算机系统应用,2022,31(2):267-272
LIU Hao,WANG Yu-Chen.Clustering Short Text Classification Based on Fusion of BERT and GSDMM.COMPUTER SYSTEMS APPLICATIONS,2022,31(2):267-272