Abstract:Feature selection, whose premise is feature extraction, is a key step to improve the accuracy and efficiency in retweeting prediction through achine learning methods. Currently, the approaches commonly adopted in feature selection include Information Gain (IG), mutual information, and CHI-square test (CHI). In the traditional feature selection methods, such problems of IG and CHI as negative correlation and interference calculation elicited by low-frequency words lead to low classification accuracy. In view of these problems, we introduce a balance factor and a word frequency factor in this study to increase the algorithm accuracy. Then, according to the spread characteristics of Weibo information, combined with the improved IG and CHI algorithms, we propose the feature selection method based on Balance Information Gain-Word Frequency CHI-square test (BIG-WFCHI). Furthermore, we experimentally test the proposed method with five classifiers including maximum entropy model, support vector machine, naive Bayes classifier, K-nearest neighbor, and multi-layer perceptron on two heterogeneous data sets. The results show that our method can effectively eliminate both irrelevant and redundant features, increase the classification accuracy, and reduce the running time.