Abstract:The classical TF-IDF algorithm only considers the feature term frequency, inverse document frequency, etc. but overlooks the distribution information of feature terms between and inside categories. In this study, we calculate the weights of feature terms through the TF-IDF algorithm in the corpus with different scales and analyze the impact of category information on weights. Based on this, a new method is proposed to measure the distribution information of feature terms between and inside categories. Furthermore, an improved TF-IDF-DI algorithm based on category information is proposed by adding two new weights and discrete factors between and inside categories to the classic TF-IDF algorithm. The Naive Bayes algorithm is used to validate the classification performance of the improved algorithm. Experiments show that the algorithm is superior to the classic TF-IDF algorithm in precision, recall, and F1 values.