本文已被:浏览 780次 下载 1462次
Received:June 15, 2020 Revised:July 14, 2020
Received:June 15, 2020 Revised:July 14, 2020
中文摘要: 特征选择是用机器学习方法提高转发预测精度和效率的关键步骤, 其前提是特征提取. 目前, 特征选择中常用的方法有信息增益(Information Gain, IG)、互信息和卡方检验(CHI-square test, CHI)等, 传统特征选择方法中出现低频词引起的信息增益和卡方检验的负相关、干扰计算等问题, 导致分类准确率不高. 本文首先针对低频词引起的信息增益和卡方检验的负相关、干扰计算等问题进行研究, 分别引入平衡因子和词频因子来提高算法的准确率; 其次, 根据微博信息传播的特点, 结合改进的IG算法和CHI算法, 提出了一种基于BIG-WFCHI (Balance Information Gain-Word Frequency CHI-square test) 的特征选择方法. 实验分析中, 本文采用基于最大熵模型、支持向量机、朴素贝叶斯分类器、KNN和多层感知器5种分类器对两个异构数据集进行了测试. 实验结果表明, 本文提出的方法能有效消除无关特征和冗余特征, 提高分类精度, 并减少运算时间.
Abstract:Feature selection, whose premise is feature extraction, is a key step to improve the accuracy and efficiency in retweeting prediction through achine learning methods. Currently, the approaches commonly adopted in feature selection include Information Gain (IG), mutual information, and CHI-square test (CHI). In the traditional feature selection methods, such problems of IG and CHI as negative correlation and interference calculation elicited by low-frequency words lead to low classification accuracy. In view of these problems, we introduce a balance factor and a word frequency factor in this study to increase the algorithm accuracy. Then, according to the spread characteristics of Weibo information, combined with the improved IG and CHI algorithms, we propose the feature selection method based on Balance Information Gain-Word Frequency CHI-square test (BIG-WFCHI). Furthermore, we experimentally test the proposed method with five classifiers including maximum entropy model, support vector machine, naive Bayes classifier, K-nearest neighbor, and multi-layer perceptron on two heterogeneous data sets. The results show that our method can effectively eliminate both irrelevant and redundant features, increase the classification accuracy, and reduce the running time.
keywords: Weibo information feature selection machine learning Information Gain (IG) CHI-square test (CHI)
文章编号: 中图分类号: 文献标志码:
基金项目:国家自然科学基金(61672027)
引用文本:
殷仕刚,安洋,蔡欣华,屈小娥.基于BIG-WFCHI的微博信息关键特征选择方法.计算机系统应用,2021,30(2):188-193
YIN Shi-Gang,AN Yang,CAI Xin-Hua,QU Xiao-E.Key Feature Selection Method for Weibo Information Based on BIG-WFCHI.COMPUTER SYSTEMS APPLICATIONS,2021,30(2):188-193
殷仕刚,安洋,蔡欣华,屈小娥.基于BIG-WFCHI的微博信息关键特征选择方法.计算机系统应用,2021,30(2):188-193
YIN Shi-Gang,AN Yang,CAI Xin-Hua,QU Xiao-E.Key Feature Selection Method for Weibo Information Based on BIG-WFCHI.COMPUTER SYSTEMS APPLICATIONS,2021,30(2):188-193