﻿ 基于二维信息增益加权的朴素贝叶斯分类算法
 计算机系统应用  2019, Vol. 28 Issue (6): 135-140 PDF

Naive Bayes Classification Algorithm of Feature Weighting Based on Two-Dimensional Information Gain
REN Shi-Chao, HUANG Zi-Liang
School of Communication Engineering, Chengdu University of Information Engineering, Chengdu 610225, China
Abstract: Naive Bayes algorithm is based on feature-independence assumption and the traditional TF-IDF weighting algorithm, and only considers the distribution of features in the whole training set, but ignores the relationship between feature and categories or documents, so the weights given by traditional method cannot represent its performance. To solve the above problems, this study proposes a naive Bayes classification algorithm of feature weighting based on two-dimensional information gain. It considers the effects of two-dimensional information gain of features, which are the information gain of category and the information gain of documents. Compared with the traditional naive Bayesian algorithm of feature weighting, the proposed algorithm can improve about 6% in the precision, recall, F1 value performance.
Key words: naive Bayes     text classification     feature weighting     two-dimensional information gain     weighting algorithm

1 朴素贝叶斯及其改进加权算法 1.1 朴素贝叶斯算法

 $P({C_j}|{D_i}) = \frac{{P({D_i}|{C_j})P({C_j})}}{{P({D_i})}}$ (1)

 $\mathop {\max }\limits_{{C_j} \in C} P({C_j}|{D_t}) = \mathop {\max }\limits_{{C_j} \in C} P({D_t}|{C_j})P({C_j})$ (2)

 ${C_{map}} = \mathop {\max }\limits_{{C_j} \in C} P({C_j}|{D_i}) = \mathop {\max }\limits_{{C_j} \in C} P({C_j})\prod\limits_{i = 1}^m {P({t_i}|{C_j})}$ (3)

1.2 基于TF-IDF加权朴素贝叶斯算法

 ${C_{map}} = \mathop {\max }\limits_{{C_j} \in C} P({C_j}|{D_t}) = \mathop {\max }\limits_{{C_j} \in C} P({C_j})\prod\limits_{i = 1}^m {P{{({t_i}|{C_j})}^{{W_i}}}}$ (4)

 $\begin{array}{ll} {C_{map}} &= \mathop {\max }\limits_{{C_j} \in C} P({C_j}|{D_i})\\ &= \mathop {\max }\limits_{{C_j} \in C} [\ln P({C_j}) + \displaystyle\sum\limits_{i = 1}^m {\ln P({t_i}|{C_j})*{W_i}} ] \end{array}$ (5)

TF-IDF算法的思想: 特征单词虽然在整个文本集中出现的频率比较低, 但是在某特定文本中出现的频数越大, 则对于该文本的分类作用越大, 反之, 特征单词在大多数文档中出现的频数越大, 对于文本的分类作用越小[6,11]. TF-IDF算法将词频和反文档频率结合作为特征的权重, 归一化计算方法:

 $IDF({t_i}) = lb(\frac{N}{{n({t_i})}} + 0.01)$ (6)
 ${W_i} = TF({t_i})*IDF({t_i}) = \frac{{TF({t_i})*IDF({t_i})}}{{\sqrt {\displaystyle\sum\nolimits_{i = 1}^m {{{(TF({t_i})*IDF({t_i}))}^2}} } }}$ (7)

1.3 基于TF-IDF*IGC加权改进算法

 ${W_i} = TF({t_i})*IDF({t_i})*IG({t_i},{C_j})$ (8)
 \begin{aligned} I{G_c}(C,t) &= E(C) - E({C_j}|t)\\ & = \displaystyle\sum\limits_{j = 1}^V {P({C_j}|t)*lb(P({C_j}|t)} ) - \displaystyle\sum\limits_{j = 1}^V {P({C_j})*lb(P({C_j}))} \end{aligned} (9)

2 基于IGC-IGD特征加权朴素贝叶斯算法

 $P(t,{C_j}) = \frac{{tf(t,{C_j}) + L}}{{\displaystyle\sum\nolimits_{j = 1}^V {tf(t,{C_j}) + V*L} }}$ (10)

 $P({D_t},{C_j}) = \frac{{tf({D_t},{C_j}) + L}}{{\displaystyle\sum\nolimits_{j = 1}^V {tf({D_t},{C_j}) + V*L} }}$ (11)

 \begin{aligned} IGC &= E({C_j}) - E({C_j}|t) \\ &= \displaystyle\sum\limits_{j = 1}^V {P(t,{C_j})*lb(P(t,{C_j})) - \displaystyle\sum\limits_{j = 1}^V {P({C_j})*lb(P({C_j}))} }\!\!\!\!\!\! \end{aligned} (12)
 \begin{aligned} IGD &= E({C_j}) - E({C_j}|{D_t}) \\ & =\displaystyle\sum\limits_{j = 1}^V {P({D_t},{C_j})*lb(P({D_t},{C_j})) - \displaystyle\sum\limits_{j = 1}^V {P({C_j})*lb(P({C_j}))} } \end{aligned} (13)

 ${W_{IGC - IGD}} = \frac{{IGC({t_i})*IGD({t_i})}}{{\displaystyle\sum\nolimits_{i = 1}^m {{{(IGC({t_i})*IGD({t_i}))}^2}} }}$ (14)

3 实验设计与结果分析 3.1 实验数据及其预处理

3.2 实验结果分析

F1值: $F1 = \dfrac{{2*P*R}}{{P + R}}$

F1值: $Macro\_F1 = \dfrac{{2*Macro\_P*Macro\_R}}{{Macro\_P + Macro\_R}}$

TP表示正确的标记为正, FP错误的标记为正, FN错误的标记为负, TN正确的标记为负[4], 如表3所示.

 图 1 3种算法宏F1值比较

F1值对应查准率P和召回率R的调和均值. 宏F1值是所有类别对应的F1值得平均, 能够反应各加权算法整体性能(查准率、召回率、F1值)的指标. 由图1可以看出, 当特征数量从500增加到1000时, IGC-IGD加权的朴素贝叶斯分类算法的宏F1值要高于其他两个算法, 根据3.2宏F1值的计算公式计算可以得到该算法相比传统的加权算法宏F1值提升将近6%左右, 而且该算法本身不会因为特征数量的变化出现较大的波动, 说明在给定一定量有价值的特征时, 二维信息增益IGC-IGD加权的朴素贝叶斯分类算法能够有效的对文本进行分类.

4 结束语

 [1] 邸鹏, 段利国. 一种新型朴素贝叶斯文本分类算法. 数据采集与处理, 2014, 29(1): 71-75. DOI:10.3969/j.issn.1004-9037.2014.01.010 [2] Han JW, Kamber M. 数据挖掘: 概念与技术. 范明, 孟小峰, 译. 北京: 机械工业出版社, 2007. [3] 李忠波, 杨建华, 刘文琦. 基于数据填补和连续属性的朴素贝叶斯算法. 计算机工程与应用, 2016, 52(1): 133-140. DOI:10.3778/j.issn.1002-8331.1401-0232 [4] 周志华. 机器学习. 北京: 清华大学出版社, 2016. [5] 张玉芳, 陈小莉, 熊忠阳. 基于信息增益的特征词权重调整算法研究. 计算机工程与应用, 2007, 43(35): 159-161. DOI:10.3321/j.issn:1002-8331.2007.35.048 [6] 李学明, 李海瑞, 薛亮, 等. 基于信息增益与信息熵的TFIDF算法. 计算机工程, 2012, 38(8): 37-40. DOI:10.3778/j.issn.1002-8331.2012.08.011 [7] 饶丽丽, 刘雄辉, 张东站. 基于特征相关的改进加权朴素贝叶斯分类算法. 厦门大学学报(自然科学版), 2012, 51(4): 682-685. [8] 武建军, 李昌兵. 基于互信息的加权朴素贝叶斯文本分类算法. 计算机系统应用, 2017, 26(7): 178-182. [9] 贺鸣, 孙建军, 成颖. 基于朴素贝叶斯的文本分类研究综述. 情报科学, 2016, 34(7): 147-154. [10] Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 1988, 24(5): 513-523. [11] 李凯齐, 刁兴春, 曹建军. 基于信息增益的文本特征权重改进算法. 计算机工程, 2011, 37(1): 16-18, 21. DOI:10.3969/j.issn.1000-3428.2011.01.006 [12] Jiang LX, Li CQ, Wang SS, et al. Deep feature weighting for naive Bayes and its application to text classification. Engineering Applications of Artificial Intelligence, 2016, 52: 26-39. DOI:10.1016/j.engappai.2016.02.002 [13] Zhang LG, Jiang LX, Li CQ, et al. Two feature weighting approaches for naive Bayes text classifiers. Knowledge-Based Systems, 2016, 100: 137-144. DOI:10.1016/j.knosys.2016.02.017 [14] Song Y, Kołcz A, Lee Giles C. Better Naive Bayes classification for high-precision spam detection. Software—Practice & Experience, 2009, 39(11): 1003-1024. [15] He W, Zhang Y, Yu SJ, et al. Deep feature weighting with a novel information gain for naive Bayes text classification. JIHMSP, 2019, 10(1). [16] Wu J, Cai ZH, Zhu XQ. Self-adaptive probability estimation for naive Bayes classification. Proceedings of 2013 International Joint Conference on Neural Networks. Dallas, TX, USA. 2013. 1–8. [17] Li L, Li C. Research and improvement of a spam filter based on naive Bayes. Proceedings of the 7th International Conference on Intelligent Human-Machine Systems and Cybernetics. Hangzhou, China. 2015. 361–364. [18] Jiang QW, Wang W, Han X, et al. Deep feature weighting in Naive Bayes for Chinese text classification. Proceedings of the 2016 4th International Conference on Cloud Computing and Intelligence Systems. Beijing, China. 2016. 160–164. [19] Arar ÖF, Ayan K. A feature dependent naive Bayes approach and its application to the software defect prediction problem. Applied Soft Computing, 2017, 59: 197-209. DOI:10.1016/j.asoc.2017.05.043 [20] Chen JN, Huang HK, Tian SF, et al. Feature selection for text classification with naïve Bayes. Expert Systems with Applications, 2009, 36(3): 5432-5435. DOI:10.1016/j.eswa.2008.06.054