Abstract:In the task of text classification, traditional natural language processing methods have limitations in short text classification due to the sparse features and irregular wording of short texts. Considering the characteristics of short texts, this study proposes a classification algorithm based on the fusion of bidirectional encoder representations from Transformers (BERT) and a collapsed Gibbs sampling algorithm for the Dirichlet multinomial mixture model (GSDMM) and clustering guidance to improve the effectiveness and accuracy of short text classification. First, the model converts short texts into integrated semantic vectors by using the fusion model of BERT and GSDMM. The integrated vectors reflect global semantic features and topic features and solve the problems of sparse short text features and the lack of topic information. Then, the clustering guidance algorithm is introduced into the front-end training of the classifier, which realizes the expansion of the labeled data and improves the interpretability of the results. Finally, the expanded labeled data set is used to train the classifier to complete the automatic classification of short texts. Taking the negative comment of an e-commerce platform as the verification data set, this study verifies the effectiveness and advantages of the algorithm in short text classification in multiple groups of comparative experiments.