﻿ 融合序列后向选择与支持向量机的混合式特征选择算法
 计算机系统应用  2019, Vol. 28 Issue (7): 174-179 PDF

1. 武夷学院 数学与计算机学院, 武夷山 354300;
2. 认知计算与智能信息处理福建省高校重点实验室, 武夷山 354300

Hybrid Feature Selection Algorithm for Fusion Sequence Backward Selection and Support Vector Machine
WU Qing-Shou, LIU Chang-Yong, LIN Li-Hui
School of Mathematics and Computer Science, Wuyi University, Wuyishan 354300,China;
Fujian Provincial Key Laboratory of Cognitive Computing and Intelligent Information Processing, Wuyishan 354300, China
Foundation item: Natural Science Foundation of Fujian Province (2019J01835, 2017J01651, 2017J01780); Mid-aged and Young Teachers Program for Education Research of Fujian Province (JAT170608); Open Fund of Fujian Provincial Key Laboratory of Cognitive Computing and Intelligent Information Processing (KLCCIIP2017104)
Abstract: Dimensional disaster is a common problem in machine learning tasks. The feature selection algorithm can select the optimal feature subset from the original data set and reduce the feature dimension. A hybrid feature selection algorithm is proposed. Firstly, the chi-square test and filtering method are used to select the important feature subsets and normalize scale, and then SBS-SVM wrapped by SBS and SVM. The algorithm selects the optimal feature subset to maximize the classification performance and effectively reduce the number of features. In the experiment, the SBS-SVM in the parcel stage and the other two algorithms are tested on three classical data sets. The results show that the SBS-SVM algorithm has better performance in classification performance and generalization ability.
Key words: hybrid feature selection     sequential backward selection     Support Vector Machine (SVM)     dimension reduction

1 混合式特征选择算法模型

 $x_{std}^{(i)} = \frac{{{x^{(i)}} - {\mu _x}}}{{{\sigma _x}}}$ (1)

 图 1 混合式特征选择算法流程

2 算法实现 2.1 序列后向选择算法

SBS算法的基本思想是: 在分类性能衰减最小的约束下, 通过逐步移除不相关的特征, 选取出与问题最相关的特征子集, 从而提高分类学习算法的计算效率, 通过缩减不相关特征可以降低模型的泛化误差, 提高模型的预测能力.

2.2 支持向量机

 $\mathop {\min }\limits_{w,b} \frac{1}{2}{\left\| w \right\|^2},\;\;{\rm{s.t.}}\;\;{y_j}({w^{\rm{T}}}{x_j} + b) - 1 \geqslant 0$ (2)

 \left\{ \begin{aligned} &\mathop {\min }\limits_{w,b,\xi } \frac{1}{2}{\left\| w \right\|^2} + C\sum\limits_{j = 1}^N {{\xi _j}}\\ &{\rm{s.t.}}\;\;{y_j}({w^{\rm{T}}}{x_j} + b) + {\xi _j} - 1 \geqslant 0,\;\;{\xi _j} \geqslant 0, i = 1.2, \cdots, N \end{aligned}\right. (3)

 $\sum\limits_{i = 1}^N {\alpha _i^*{y_i}(x \cdot {x_i}) + b = 0}$ (4)

 $f(x) = sign\left( {\sum\limits_{i = 1}^N {\alpha _i^*{y_i}(x \cdot {x_i}) + b} } \right)$ (5)
2.3 混合式特征选择算法

1. 用卡方检验方法计算特征与类别的相关性 $\scriptstyle Rf$

2. 根据 $\scriptstyle Rf$ 选择n个重要特征集合 $\scriptstyle X_n$

3. 对 $\scriptstyle X_n$ 进行标准化, 得到标准化特征子集 $\scriptstyle SX_n$

4. 用SBS-SVM对 $\scriptstyle SX_n$ 进行特征优选, 得到最优特征子集 $\scriptstyle {X^ - }$

SBS-SVM算法中, 从 $SXn$ 中逐一减少特征数, 并用SMO算法根据当前特征集进行分类, 评估其分类性能, 最后得到一个或多个最优特征子集, 并在实验阶段讨论各最优特征子集的实际分类性能.

SBS-SVM算法

1. 　 $\scriptstyled = |SXn|$

2. 　 $\scriptstyle{X^ - } = SXn$

3. 　 $\scriptstyleer{r_d} = \infty$

4. 　while $\scriptstyled > k$

5. 　　for c in $\scriptstylecomb({X^ - }, d - 1)$

6. 　　　 $\scriptstyleres = SMO(Ds, c)$

7. 　　　 $\scriptstyleer{r^*} = 1 - Accuracy(res)$

8. 　　　if $\scriptstyle(er{r^*} < er{r_d})$ then

9. 　　　　 $\scriptstyleer{r_d} = er{r^*}$

10. 　　　　 $\scriptstyleX_d^ - = c$

11. 　　　end if

12. 　　end for

13. 　　 $\scriptstyle{X^ - } = X_d^ -$

14. 　　 $\scriptstyled = d - 1$

15. 　end while

3 实验

3.1 评价指标

 $accuracy(y, y') = \frac{1}{{{n_{samples}}}}\sum\limits_{i = 0}^{{n_{samples}} - 1} {1({y_i}' = {y_i})}$ (6)

F1得分的定义如式(7):

 $F1 = \frac{{2 \times P \times R}}{{P + R}} = \frac{{2 \times TP}}{{2 \times TP + FP + FN}}$ (7)

 $micorF1 = \frac{{2 \times microP \times microR}}{{microP + microR}}$ (8)

 $Dr = 1 - (SF/AF)$ (9)

3.2 特征过滤实验

3.3 算法在重要特征子集上的实验

 图 2 在Iris数据集上的最高准确率

 图 3 在Wine特征子集组合上的最高准确率

 图 4 在WDBC特征子集组合上的最高准确率

3.4 算法在最优特征子集上的实验

 图 5 在Wine最优特征子集上的实验

 图 6 在WDBC最优特征子集上的实验

3.5 SBS-SVM中松弛变量的影响实验

SBS-SVM算法中, 松弛变量的选择对实验结果也会有一定的影响, 本部分实验在两个数据集上对松弛变量的变化与F1得分的关系进行讨论. 在Wine数据集中选择的特征数为7, 在WDBC数据集中选择的特征数为12.

 图 7 松弛变量实验

4 结束语

 [1] 黄铉. 特征降维技术的研究与进展. 计算机科学, 2018, 45(6A): 16-21, 53. [2] Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(4): 491-502. DOI:10.1109/TKDE.2005.66 [3] 初蓓, 李占山, 张梦林, 等. 基于森林优化特征选择算法的改进研究. 软件学报, 2018, 29(9): 2547-2558. DOI:10.13328/j.cnki.jos.005395 [4] Almuallim H, Dietterich TG. Learning Boolean concepts in the presence of many irrelevant features. Artificial Intelligence, 1994, 69(1-2): 279-305. DOI:10.1016/0004-3702(94)90084-1 [5] Pudil P, Novovičová J, Kittler J. Floating search methods in feature selection. Pattern Recognition Letters, 1994, 15(11): 1119-1125. DOI:10.1016/0167-8655(94)90127-9 [6] Fujarewicz K, Wiench M. Selecting differentially expressed genes for colon tumor classification. International Journal of Applied Mathematics and Computer Science, 2003, 13(3): 327-335. [7] Kabir MM, Shahjahan M, Murase K. A new hybrid ant colony optimization algorithm for feature selection. Expert Systems with Applications, 2012, 39(3): 3747-3763. DOI:10.1016/j.eswa.2011.09.073 [8] Mao Y, Zhou XB, Xia Z, et al. A survey for study of feature selection algorithms. Pattern Recognition and Artificial Intelligence, 2007, 20(2): 211-218. [9] 叶小泉, 吴云峰. 基于支持向量机递归特征消除和特征聚类的致癌基因选择方法. 厦门大学学报(自然科学版), 2018, 57(5): 702-707. [10] Tan KC, Teoh EJ, Yu Q, et al. A hybrid evolutionary algorithm for attribute selection in data mining. Expert Systems with Applications, 2009, 36(4): 8616-8630. DOI:10.1016/j.eswa.2008.10.013 [11] 谢娟英, 谢维信. 基于特征子集区分度与支持向量机的特征选择算法. 计算机学报, 2014, 37(8): 1704-1718. [12] 雷海锐, 高秀峰, 刘辉. 基于机器学习的混合式特征选择算法. 电子测量技术, 2018, 41(16): 42-46. [13] 武小年, 彭小金, 杨宇洋, 等. 入侵检测中基于SVM的两级特征选择方法. 通信学报, 2015, 36(4): 2015127. [14] Platt JC. Fast training of support vector machines using sequential minimal optimization. Schölkopf B, Burges CJC, Smola AJ. Advances in Kernel Methods: Support Vector Learning. Cambridge: MIT Press, 1998, 185-208.