﻿ 决策树分类算法中C4.5算法的研究与改进
 计算机系统应用  2019, Vol. 28 Issue (6): 198-202 PDF

1. 武夷学院 数学与计算机学院, 武夷山 354300;
2. 认知计算与智能信息处理福建省高校重点实验室, 武夷山 354300

Research and Improvement of C4.5 Algorithm in Decision Tree Classification Algorithm
HAN Cun-Ge1,2, YE Qiu-Sun1
1. School of Mathematics and Computer Science, Wuyi University, Wuyishan 354300, China;
2. Fujian Provincial Key Laboratory of Cognitive Computing and Intelligent Information Processing, Wuyishan 354300, China
Foundation item: Natural Science Foundation of Science and Technology Bureau, Fujian Province (2017J01406)
Abstract: C4.5 algorithm is a classical algorithm used to generate decision tree. Although it has strong noise processing ability, the classification accuracy of C4.5 algorithm decreases obviously when the missing rate of attribute value is high, and the algorithm needs to scan many times when constructing decision tree. This paper presents an improved classification algorithm for sorting data sets and calling logarithms frequently. A method based on naive Bayesian theorem is used to deal with the vacant attribute value and improve the classification accuracy. By optimizing and reducing the calculation formula, the improved formula uses four mixed operations to replace the original logarithmic operation, thus reducing the running time of constructing the decision tree. In order to verify the performance of the algorithm, five data sets in UCI database are tested. The experimental results show that the improved algorithm greatly improves the running efficiency.
Key words: decision tree     C4.5 algorithm     Naive Bayesian classifier     UCI

1 C4.5算法理论

C4.5算法是在ID3的基础上加进了对连续型属性、属性值空缺情况的处理, 通过生成树和树剪枝两个阶段来建立决策树. C4.5算法通过计算每个属性的信息增益率(information GainRatio), 最后选出具有最高信息增益率的属性给定集合S的测试属性, 再根据测试属性值建立分支, 采用递归算法, 得到初步决策树. C4.5算法的相关计算公式如下[9]:

(1)期望信息(也称信息熵) 设SSi个数据样本的集合, 假定类标号属性有m个不同值, 定义m个不同类Ti (i=1, …, m). 设Si是类Ti中的样本数. 对一个给定的样本分类所需的期望值为:

 $Info(S) = - \sum\limits_{i = 1}^m {\frac{{{s_i}}}{s}} {\log _2}\frac{{{S_i}}}{S}$ (1)

(2)信息增益 由属性A划分成子集的信息量为:

 $E(A) = Inf{o_A}(S) = \mathop \sum \limits_{j = 1}^v \frac{{{S_i}}}{S}Info({S_{ij}})$ (2)

 $\text{即}{{G}}ain(A) = Info(S) - E(A)$ (3)

(3)信息增益率C4.5算法中引入了信息增益率, 属性A的信息增益率的计算公式为:

 ${{GainRatio(A)}} = \frac{{{{Gain}}({{A}})}}{{{{SplitE}}({{A}})}}$ (4)
 $SpliE(A) = - \mathop \sum \limits_{i = 1}^k \frac{{|{S_i}|}}{{|S|}}{\log _2}\frac{{|{S_i}|}}{{|S|}}$ (5)

C4.5算法采用后剪枝技术对生成的决策树进行剪枝操作, 形成决策树模型, 根据建立好的模型, 生成一系列IF-THEN 规则, 实现对训练集的分类.

2 改进的C4.5算法 2.1 决策分类过程中空缺属性值处理研究

(1) 延伸贝叶斯分类算法

 $\frac{{p({c_i})p(v|{c_i})}}{{p(v)}} > \frac{{p({c_j})p(v|{c_j})}}{{p(v)}}$

Vm个属性互相独立, 那么

 $p(v|{c_i}) = \prod\limits_{k = 1}^n {p({v_k}|{c_i})}$

 $p({c_i})\prod\limits_{k = 1}^n {p({v_k}|{c_i})} > p({c_j})\prod\limits_{k = 1}^n {p({v_k}|{c_j})}$

 $\frac{{{s_i}}}{s}\prod\limits_{k = 1}^n {p({v_k}|{c_i})} > \frac{{{s_j}}}{s}\prod\limits_{k = 1}^n {p({v_k}|{c_j})},\;\; \\ 1 <= {{j}} <= {{m}}\;\;\text{且}\;\;{{j}} \ne {{i}}$

 $\begin{split} &\dfrac{{{s_i}^\prime }}{s}\prod\limits_{k = 1}^n {p({{v''}_k}|{{c'}_i})} > \dfrac{{{{s'}_j}}}{s}\prod\limits_{k = 1}^n {p({{v''}_k}|{{c'}_j})} \\ &1 <= {{j}} <= {{m}}\;\;\text{且}\;\;{{j}} \ne {{i}} \end{split}$

(2) 空缺属性值处理过程

① 读入测试数据.

② 若数据集中没有空缺属性值, 则把数据压入Di集合; 如果测试数据集中含有空缺属性值, 采用“样本空缺属性个数排序, 越少越排前, 反之则排后. ”的原则插入到D2集合.

③ 返回第一步, 直到测试数据集中所有数据都检测结束. 最后形成DiD2两个集合. 其中Di中存放没有空缺属性值, D2中存放包含空缺属性的信息.

④ 取出D2中的记录数据, 为缺失属性赋值计入Di.

⑤ 递归调用上述第④步, 直到D2中记录为0.

2.2 改进思想及改进计算方式

 $Info(S) = - \frac{m}{{m + n}}\log _2^{\frac{m}{{m + n}}} - \frac{n}{{m + n}}\log _2^{\frac{n}{{m + n}}}$

 $I({S_i}) = - {\frac{{{m_i}}}{{{m_i} + n_i}}}\log _2^{\frac{{{m_i}}}{{{m_i} + {n_i}}}} - {\frac{{{n_i}}}{{{m_i} + n_i}}}\log _2^{{{\frac{{{n_i}}}{{{m_i} + n_i}}}}}$

 $\begin{split}Inf{o_A}(S) &= \sum\limits_{i = 1}^v {\frac{{{m_i} + {n_i}}}{{m + n}}} I({S_i})\\ &= \frac{1}{{(m + n)\ln 2}}\sum\limits_{i = 1}^v { - {m_i}\ln \frac{{{m_i}}}{{{m_i} + {n_i}}} - {n_i}\ln \frac{{{n_i}}}{{{m_i} + {n_i}}}}\end{split}$
 $\begin{split} Inf{o_A}(S) &= \sum\limits_{i = 1}^v {\frac{{{m_i} + {n_i}}}{{m + n}}} I({S_i}) \\ & = \frac{1}{{(m + n)\ln 2}}\sum\limits_{i = 1}^v {( - {m_i}\ln \frac{{{m_i}}}{{{m_i} + {n_i}}} - {n_i}\ln \frac{{{n_i}}}{{{m_i} + {n_i}}}} ) \end{split}$
 $Inf{o_A}(S) = \sum\limits_{i = 1}^v {( - {m_i}\ln \frac{{{m_i}}}{{{m_i} + {n_i}}} - {n_i}\ln \frac{{{n_i}}}{{{m_i} + {n_i}}})}$

(由于 $\dfrac{1}{{(m + n)\ln 2}}$ 为常数, 故可以简化).

 $\ln \frac{{{m_i}}}{{{m_i} + {n_i}}} = \ln (1 - \frac{{{n_i}}}{{{m_i} + {n_i}}}) = - \frac{{{n_i}}}{{{m_i} + {n_i}}}$

$\text{同理},\ln \dfrac{{{n_i}}}{{{m_i} + {n_i}}} = = - \dfrac{{{m_i}}}{{{m_i} + {n_i}}}$ .

 $\begin{split}Inf{o_A}(S) &= \sum\limits_{i = 1}^v {\left( - {m_i}\ln \frac{{{m_i}}}{{{m_i} + {n_i}}} - {n_i}\ln \frac{{{n_i}}}{{{m_i} + {n_i}}}\right)}\\ &= \sum\limits_{i = 1}^v \left({\frac{{{m_i}{n_i}}}{{{m_i} + {n_i}}}} + \frac{{{m_i}{n_i}}}{{{m_i} + {n_i}}}\right) = \sum\limits_{i = 1}^v {\frac{{2{m_i}{n_i}}}{{{m_i} + {n_i}}}} \end{split}$
 $Inf{o_A}(S) = \sum\limits_{i = 1}^v {\frac{{{m_i}{n_i}}}{{{m_i} + {n_i}}}} (\text{常数2省略})$

 $\begin{split} SplitE(A) &= - \frac{{{n_i}}}{{{m_i} + {n_i}}}\log _2^{\frac{{{n_i}}}{{{m_i} + {n_i}}}} - \frac{{{m_i}}}{{{m_i} + {n_i}}}\log _2^{\frac{{{m_i}}}{{{m_i} + {n_i}}}} \\ &= \sum\limits_{i = 1}^v {\frac{{2{m_i}{n_i}}}{{\ln 2{{({m_i} + n{}_i)}^2}}}} \end{split}$
 $SplitE(A) = \sum\limits_{i = 1}^v {\frac{{{m_i}{n_i}}}{{{{({m_i} + n{}_i)}^2}}}} \left(\text{因为}\frac{2}{{\ln 2}}\text{为常数可简化}\right)$

2.3 N-C4.5算法实际应用

Step1. 计算分类所需的期望信息

 $Info(S) = \frac{{mn}}{{{{(m + n)}^2}}} = \frac{{9 \times 5}}{{{{(9 + 5)}^2}}} = 0.230$

Step2. 计算每个属性的信息熵

 $\begin{split} &Info(age) = \frac{{2 \times 3}}{{2 + 3}} + \frac{{4 \times 0}}{{4 + 0}} + \frac{{3 \times 2}}{{3 + 2}} = 2.4\\ &Info(income) = 3.083\\ &Info(student) = 2.571\\ &Info(credit\_rating) = 3.0 \end{split}$

Step3. 计算每个属性的信息增益

 $\begin{split} Gain(age) &= Info(S) - Info(age) \\ & = 0.{\rm{23}}0 - {\rm{2}}.{\rm{4}} = - {\rm{2}}.{\rm{17}}0 \end{split}$
 $Gain(income) = 0.{\rm{23}}0 - 3.083 = - {\rm{2}}.783$
 $Gain(student) = 0.{\rm{23}}0 - 2.571 = - {\rm{2}}.341$
 $Gain(credit\_rating) = 0.{\rm{23}}0 - 3.0 = - {\rm{2}}.8$

Step4. 计算每个属性的分裂信息度量

 $\begin{split} &SplitE(age) = \frac{{5 \times 4 \times 5}}{{{{(5 + 4 + 5)}^2}}} = 0.510\\ &SplitE(income) = 0.490\\ &SplitE(student) = 0.250\\ &SplitE(credit\_rating) = 0.245 \end{split}$

Step5. 信息增益率

 $\begin{split}&GainRatio(age) = \frac{{ - 2.170}}{{0.510}} = - 4.255\\ &GainRatio(income) = - 5.680\\ &GainRatio(student) = - 9.346\\ &GainRatio(credit\_rating) = - 11.430 \end{split}$

 图 1 N-C4.5算法构造的决策树

3 实验结果比较分析

Iris数据集是构建决策树挖掘中最典型的一组数据, 共有150条记录, 有4个分裂属性, 类标记属性为3种, 人为添加部分缺失信息, 缺失属性为sepal width, 共有30条缺失记录, 使用Iris数据集对改进前后的算法进行5次仿真实验, 实验结果如图2所示.

 图 2 Iris数据建模时间

4 总结

 [1] 芦思雨. 数据挖掘中分类算法的比较分析[硕士学位论文]. 天津: 天津财经大学, 2016. [2] 唐一. ID3改进算法在高校就业系统中的应用研究[硕士学位论文]. 长沙: 湖南大学, 2012. [3] 陈杰, 邬春学. 决策树C4.5算法改进与应用. 软件导刊, 2018(10): 88-92. [4] 黄秀霞. C4.5决策树算法优化及其应用[硕士学位论文]. 无锡: 江南大学, 2017. [5] 徐鹏, 林森. 基于C4.5决策树的流量分类方法. 软件学报, 2009, 20(10): 2692-2704. [6] 蔡星. 决策树算法及其改进. 科技创新导报, 2014(12): 40, 45. [7] 姚亚夫, 邢留涛. 决策树C4.5连续属性分割阈值算法改进及其应用. 中南大学学报(自然科学版), 2011, 42(12): 3772-3776. [8] 刘兴文, 王典洪, 陈分雄. 一种基于变精度粗糙集的C4.5决策树改进算法宰. 计算机应用研究, 2011, 28(10): 3649-3651. DOI:10.3969/j.issn.1001-3695.2011.10.011 [9] Han JW, Kamber M. 数据挖掘: 概念与技术. 范明, 孟小峰, 译. 北京: 机械工业出版社, 2007. [10] 乐明明. 数据挖掘分类算法的研究和应用[硕士学位论文]. 成都: 电子科技大学, 2017. [11] 李旭. 五种决策树算法的比较研究[硕士学位论文]. 大连: 大连理工大学, 2011. [12] 缪连芬. 改进的C4.5算法在大学生情感素质分析中的研究与应用[硕士学位论文]. 上海: 上海师范大学, 2018. [13] 李锦善. 基于Volume Test的贝叶斯分类器研究[硕士学位论文]. 北京: 北京交通大学, 2008.