基于BIG-WFCHI的微博信息关键特征选择方法
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家自然科学基金(61672027)


Key Feature Selection Method for Weibo Information Based on BIG-WFCHI
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    特征选择是用机器学习方法提高转发预测精度和效率的关键步骤, 其前提是特征提取. 目前, 特征选择中常用的方法有信息增益(Information Gain, IG)、互信息和卡方检验(CHI-square test, CHI)等, 传统特征选择方法中出现低频词引起的信息增益和卡方检验的负相关、干扰计算等问题, 导致分类准确率不高. 本文首先针对低频词引起的信息增益和卡方检验的负相关、干扰计算等问题进行研究, 分别引入平衡因子和词频因子来提高算法的准确率; 其次, 根据微博信息传播的特点, 结合改进的IG算法和CHI算法, 提出了一种基于BIG-WFCHI (Balance Information Gain-Word Frequency CHI-square test) 的特征选择方法. 实验分析中, 本文采用基于最大熵模型、支持向量机、朴素贝叶斯分类器、KNN和多层感知器5种分类器对两个异构数据集进行了测试. 实验结果表明, 本文提出的方法能有效消除无关特征和冗余特征, 提高分类精度, 并减少运算时间.

    Abstract:

    Feature selection, whose premise is feature extraction, is a key step to improve the accuracy and efficiency in retweeting prediction through achine learning methods. Currently, the approaches commonly adopted in feature selection include Information Gain (IG), mutual information, and CHI-square test (CHI). In the traditional feature selection methods, such problems of IG and CHI as negative correlation and interference calculation elicited by low-frequency words lead to low classification accuracy. In view of these problems, we introduce a balance factor and a word frequency factor in this study to increase the algorithm accuracy. Then, according to the spread characteristics of Weibo information, combined with the improved IG and CHI algorithms, we propose the feature selection method based on Balance Information Gain-Word Frequency CHI-square test (BIG-WFCHI). Furthermore, we experimentally test the proposed method with five classifiers including maximum entropy model, support vector machine, naive Bayes classifier, K-nearest neighbor, and multi-layer perceptron on two heterogeneous data sets. The results show that our method can effectively eliminate both irrelevant and redundant features, increase the classification accuracy, and reduce the running time.

    参考文献
    相似文献
    引证文献
引用本文

殷仕刚,安洋,蔡欣华,屈小娥.基于BIG-WFCHI的微博信息关键特征选择方法.计算机系统应用,2021,30(2):188-193

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2020-06-15
  • 最后修改日期:2020-07-14
  • 录用日期:
  • 在线发布日期: 2021-01-29
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号