﻿ 基于C4.5算法优化SVM的个人信用评估模型
 计算机系统应用  2019, Vol. 28 Issue (7): 133-138 PDF

Evaluation Model for Personal Credit Risk Based on C4.5 Algorithm for Optimizing SVM
LIU Xiao-Ya, WANG Ying-Ming
School of Economics and Management, Fuzhou University, Fuzhou 350108, China
Foundation item: National Natural Science Foundation of China (61773123)
Abstract: Support Vector Machine (SVM) has been widely used in the field of credit evaluation as non-parametric method. However, it cannot actively select attributes when processing high-dimensional data which may cause a drop in accuracy. In order to overcome this shortcoming, credit evaluation model of C4.5 decision tree optimized SVM is constructed to select attributes, and reduce redundant attributes. The model determines the optimal parameters through grid search, uses F-score and average accuracy to evaluate model performance on two sets of public data sets. Empirical analysis shows that the proposed model effectively reduces data learning process, and has higher classification accuracy and practicability than the various traditional types of single models.
Key words: personal credit evaluation     support vector machine     C4.5 decision tree     attribute selection     information entropy gain rate

1 理论简介 1.1 支持向量机

 $\left\{ \begin{array}{l} \min \dfrac{1}{2}\left\| {{\omega ^2}} \right\|;\\ {\rm{s}}{\rm{.t}}{\rm{.}}\;{y_i}[{\omega ^{\rm{T}}}x + b] \ge 1\;\;\;\;(i = 1,2, \cdots ,n) \end{array}\right.$ (1)

 $f(x) = {\rm{sgn}} \left\{ {\sum\limits_{i = 1}^n {{\alpha _i}^ * {y_i}({x_i}^{\rm{T}}x) + {b^ * }} } \right\}$ (2)

 $\left\{ \begin{array}{l} \min \dfrac{1}{2}{\left\| \omega \right\|^2} + C\left( {\sum\limits_{i = 1}^n {{\xi _i}} } \right)\\ {\rm{s}}{\rm{.t}}{\rm{.}}\;{y_i}[{\omega ^T}x + b] \ge 1(i = 1,2, \cdots ,n) \end{array}\right.$ (3)

 $f(x) = {\rm{sgn}} \left\{ {\sum\limits_{i = 1}^n {{\alpha _i}^ * {y_i}K\left( {{x_i},x} \right)} + {b^ * }} \right\}$ (4)

 $K({x_i},{x_j}) = \exp \left( {\gamma {{\left\| {{x_i} - {x_j}} \right\|}^2}} \right)\\ \gamma = - \frac{1}{{2{\sigma ^2}}}$ (5)
1.2 C4.5决策树

 $GainRatio(D,A) = \frac{{Gain(D,A)}}{{Split\_\;{\rm{inf}}\;o(D,A)}}$ (6)

2 基于C4.5算法优化SVM的个人信用评估模型

C4.5算法优化SVM的个人信用评估模型包含了两个子系统: 一个是基于C4.5决策树的属性筛选和SVM参数优化系统; 一个是训练和测试SVM分类器性能系统.

2.1 SVM参数优化

2.2 C4.5算法优化SVM的个人信用评估模型

C4.5算法优化SVM的个人信用评估模型流程图如图1所示, 具体步骤如下:

(1)设置损失比例. 实际中, 将信用“好”的客户误判为信用“差”的客户损失的可能仅仅是贷款利息, 而将信用“差”的客户误判为信用“好”的客户则可能遭受巨大的违约风险, 二者所造成的损失不对等, 决策树模型通过设置损失比例将可能导致的损失引入系统分析过程.

(2)设置Boosting迭代次数. 反复Boosting迭代, 不断增大误判样本被抽为训练集的可能性, 提高模型精度.

(3)确定决策树的修剪严重性. 对比不同修剪度, 确定决策树最佳修剪程度.

(4)特征筛选. 在最优树下计算特征贡献率, 筛选对分类结果有较大影响属性.

(1)根据步骤3特征筛选的结果, 组成新数据集. 采用k折交叉验证方法, 将全部数据集分成k个不相交的子集, 假设样本数为m, 则子集就有m/k 个样例, 每次从分好的子集中里面, 拿出一个作为测试集, 其它k–1个作为训练集.

(3)训练分类器. 利用网格搜索法优化SVM参数C和核函数参数.

2.3 模型评价指标

 $F - score = \frac{{2*recall*precision{\rm{ }}}}{{{\rm{ }}recall + precision{\rm{ }}}}$ (7)
 $precision = \frac{{TP}}{{TP + FP}}$ (8)
 $recall = \frac{{TP}}{{TP + FN}}$ (9)
 $accuracy = \frac{{TP + TN}}{{TP + FP + FN + TN}}$ (10)
3 实证分析 3.1 数据集介绍

 图 1 C4.5决策树优化SVM模型

3.2 基于C4.5决策树算法特征提取

 图 2 德国信贷数据特征贡献度

 图 3 澳大利亚信贷数据特征贡献度

3.3 基于C4.5算法优化SVM的个人信用评估

4 结语

 [1] Wang G, Hao JX, Ma J, et al. A comparative assessment of ensemble learning for credit scoring. Expert Systems with Applications, 2011, 38(1): 223-230. DOI:10.1016/j.eswa.2010.06.048 [2] 陈启伟, 王伟, 马迪, 等. 基于Ext-GBDT集成的类别不平衡信用评分模型. 计算机应用研究, 2018, 35(2): 421-427. DOI:10.3969/j.issn.1001-3695.2018.02.022 [3] 向小东, 宋芳. 基于核主成分与加权支持向量机的福建省城镇登记失业率预测. 系统工程理论与实践, 2009, 29(1): 73-80. DOI:10.3321/j.issn:1000-6788.2009.01.010 [4] Guo W, Cao MY, Zheng JF. Study on Chinese banks of credit risk evaluation models of real-estate based on the BP-neural network model. Proceedings of 2009 WRI World Congress on Computer Science and Information Engineering. Los Angeles, CA, USA. 2009. 288–292. [5] Koutanaei FN, Sajedi H, Khanbabaei M. A hybrid data mining model of feature selection algorithms and ensemble learning classifiers for credit scoring. Journal of Retailing and Consumer Services, 2015, 27: 11-23. DOI:10.1016/j.jretconser.2015.07.003 [6] 向晖, 唐剑琴. 基于bagging的决策树集成消费者信用评估模型. 消费经济, 2015, 31(3): 72-74. [7] 谢娟英, 王春霞, 蒋帅, 等. 基于改进的F-score与支持向量机的特征选择方法. 计算机应用, 2010, 30(4): 993-996. [8] 吴冲, 夏晗. 基于支持向量机集成的电子商务环境下客户信用评估模型研究. 中国管理科学, 2008, 16(S1): 362-367. [9] 肖智, 王明恺, 谢林林. 基于支持向量机的大学生助学贷款个人信用评价. 清华大学学报(自然科学版), 2006, 46(S1): 1120-1124. [10] 姚尚锋. 基于主分量分析和BP神经网络的个人信用评估模型. 数学的实践与认识, 2007, 37(21): 21-24. DOI:10.3969/j.issn.1000-0984.2007.21.005 [11] Nanni L, Lumini A. An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications, 2009, 36(2): 3028-3033. DOI:10.1016/j.eswa.2008.01.018 [12] Sain SR. The nature of statistical learning theory. Technometrics, 1996, 38(4): 409. [13] Han JW. Data mining: Concepts and techniques. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc, 2005. [14] Safavian SR, Landgrebe D. A survey of decision tree classifier methodology. IEEE Transactions on Systems, Man, and Cybernetics, 1991, 23(3): 660-674. [15] Yang Y, Chen WG. Taiga: Performance optimization of the C4.5 decision tree construction algorithm. Tsinghua Science and Technology, 2016, 21(4): 415-425. DOI:10.1109/TST.2016.7536719