﻿ 基于HMIGW特征选择和XGBoost的毕业生就业预测方法
 计算机系统应用  2019, Vol. 28 Issue (6): 203-208 PDF

1. 中国科学院大学, 北京 100049;
2. 中国科学院 沈阳计算技术研究所, 沈阳 110168;
3. 沈阳高精数控智能技术股份有限公司, 沈阳 110168

Graduates Employment Forecasting Method Based on HMIGW Feature Selection and XGBoost
LI Qi1,2, SUN Yong2, JIAO Yan-Fei3, GAO Cen2, WANG Mei-Ji2
1. University of Chinese Academy of Sciences, Beijing 100049, China;
2. Shenyang Institute of Computer Technology, Chinese Academy of Sciences, Shenyang 110168, China;
3. Shenyang Golding NC Technology Co. Ltd., Shenyang 110168, China
Abstract: In order to provide more effective employment guidance work in colleges and universities, and train students in a more targeted manner, this study collects the relevant information of graduates and their employment situations, constructs a classification prediction modeling algorithm based on HMIGW feature selection and XGBoost, and applies it in graduates’ employment forecasting. In consideration of the mixed discrete-continuous characteristics of the student information data, the study proposes an HMIGW feature selection algorithm suitable for employment prediction. This method firstly correlates the characteristics of student data, then adopts forward-increasing backward recursive deletion strategy to conduct feature selection. Finally, the XGBoost prediction model is used for training and result prediction based on the selected optimal feature subset data. By comparing the results of different algorithms, the prediction method adopted in this study has a better performance in evaluation indexes such as accuracy and time, and has a positive effect on employment guidance of graduates.
Key words: graduate employment forecast     classification algorithm     feature selection

1 基于毕业生就业预测关键步骤算法介绍 1.1 特征选择算法HMIGW

(1) 针对冗余的无关特征进行过滤, 对于每个特征, 依次计算其信息度量, 求出相关性估值Ix.

 $H\left( X \right) = - \sum\limits_{{x_i} \in X} {p\left( {{x_i}} \right)\log \left( {p\left( {{x_i}} \right)} \right)}$ (1)

 $H\left( {X,Y} \right) = - \sum\limits_{{x_i} \in X} {\sum\limits_{{y_j} \in Y} {p\left( {{x_i},{y_j}} \right)\log \left( {p\left( {{x_i},{y_j}} \right)} \right)} }$ (2)
 $H(Y|X) = - \sum\limits_{{x_i} \in X} {\sum\limits_{{y_j} \in Y} {p({x_i},{y_j})\log (p({x_i},{y_j}))} }$ (3)

 $H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)$ (4)

 $I(X;Y) = \sum\limits_{{x_i} \in X} {\sum\limits_{{y_j} \in Y} {p({x_i},{y_j})} } \log \frac{{p({x_i},{y_j})}}{{p({x_i})p({y_j})}}$ (5)

 $I(X;Y) = H(X) + H(Y) - H(X,Y)$ (6)

I(X; Y)为特征X的相关性估值Ix.

(2) 采用前向特征添加后向递归删除策略进行特征选择

1) 分别计算特征xi关于类别特征的Ii, 若Ii=0, 则删除特征xi, X=X–{xi};

2) 将上一步计算得到的Ii值记为综合评估值, 根据综合评估值Ii对特征进行降序排序;

3) 使用XGBoost算法综合评估, 对步骤2)降序排序后的特征子集采用前向特征添加策略, 即子集搜索策略遍历特征空间, 然后计算出算法在该特征子集Xi上的精确度ai, 其中i表示特征子集中元素的个数:

boolean flag=false;

for (ai (i=1, …, v)) do

if (ai < ai–1)

flag = true;

if(amax<atmp) then

amax=atmp, xbest=X;

end if;

break;

end if;

end if

break;

end if

end for

util flag=false 达到终止条件

1.2 XGBoost算法

 ${\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over y} _i} = \sum\limits_{k = 1}^K {{f_k}({x_i})},\;\;\;\;{f_k} \in F$ (7)

 $Obj=L(\theta )+\Omega (\theta )=\sum\limits_{i=1}^{n}{l({{y}_{i}},{{{\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{y}}}_{i}})}+\sum\limits_{k=1}^{K}{\Omega ({{f}_{k}})}\ \ \ \ \ {{f}_{k}}\in F$ (8)

 $\Omega ({f_k}) = \gamma T + \frac{1}{2}\lambda ||\omega |{|^2}$ (9)

XGBoost采用Boosting策略均匀取样, 同时XGBoost在迭代优化的时候使用了目标函数的泰勒展开的二阶近似因此基于Boosting的XGBoost精度更高[10]. XGBoost在进行节点的分裂时, 支持各个特征利用CPU开多线程进行并行计算, 因此计算速度也更快.

2 实验分析

 图 1 基于毕业生就业预测模型构建

2.1 数据采集

2.2 数据预处理

2.3 特征选择

2.4 模型预测

(1) 验证HMIGW算法的有效性

(2) 验证本文模型的有效性

2.5 评价指标

 $precision = \frac{{{T_p}}}{{{T_p} + {N_p}}}$ (10)
 $recall = \frac{{{T_p}}}{p}$ (11)
 ${F_1} = \frac{{2 * precision * recall}}{{precision + recall}}$ (12)

2.6 实验结果分析

(1) 实验中, CFS和WFS算法分别采用最佳优先搜索(Best First, BF)和贪心单步搜索(Greedy Stepwise, GS)算法进行特征子集选择; 而HMIGW则采用前向特征添加后向递归删除策略进行特征选择.

 图 2 不同特征选择算法分类精度对比

(2) 用测试集对XGBoost算法建立的预测模型进行性能评价. 为了比较更加直观, 添加了随机森林算法作为比较. 随机森林是通过建立多棵决策树, 每棵树单独对样本进行分类, 最终分类结果由每棵树各自的分类结果通过投票决定.

 图 3 基于HMIGW的XGBoost和随机森林性能对比

(3) 综合以上, 本文采用的毕业生就业预测方法通过先求出特征相关性估值I, 在其基础上再使用分类精度来二次对每个特征对预测结果的贡献权重进行评估得出最优特征子集, 这样做有效降低征的波动性, 且不会降低预测精度. 然后采用XGBoost算法对得出的最优特征子集数据集进行分类预测, 通过串行迭代计算实现更高精度的分类效果, 在预测方面达到了97.34%的准确率, 在进行节点的分裂时利用CPU开多线程进行并行计算, 有效提升计算速度, 训练时间控制在了0.03s, 比文中讨论的别的预测方法具有较大的性能提升.

3 总结

 [1] 孙晓璇, 杨家娥, 李雅峰. 基于决策树ID3算法的高职生就业预测分析. 电脑编程技巧与维护, 2015(2): 15-16, 35. DOI:10.3969/j.issn.1006-4052.2015.02.005 [2] 唐燕, 王苹. 基于C4.5和随机森林算法的中医药院校毕业生就业预测应用研究. 中国医药导报, 2017, 14(24): 166-169. [3] 吴振磊, 刘孝赵. 一种基于BP神经网络的就业分析预测模型. 轻工科技, 2016, 32(9): 70-71, 104. [4] 朱庆生, 高璇. 应用自然邻居分类算法的大学生就业预测模型. 计算机系统应用, 2017, 26(8): 190-194. DOI:10.15888/j.cnki.csa.005906 [5] Burnasheva S, Zhuravleva I, Kustov T, et al. Creation of the effective system for students’ and graduates’ employment promotion at the university: ETU " LETI” experience. Proceedings of 2016 IEEE V Forum Strategic Partnership of Universities and Enterprises of Hi-Tech Branches. St. Petersburg, Russia. 2016. 72–73. [6] Baskakova DY, Belash OY, Shestopalov MY. Graduates’ employment: Expectations and reality. Proceedings of 2017 IEEE VI Forum Strategic Partnership of Universities and Enterprises of Hi-Tech Branches. St. Petersburg, Russia. 2017. 128–131. [7] 谢晓龙, 叶笑冬, 董亚明. 梯度提升随机森林模型及其在日前出清电价预测中的应用. 计算机应用与软件, 2018, 35(9): 327-333. DOI:10.3969/j.issn.1000-386x.2018.09.058 [8] Chen TQ, Guestrin C. XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, CA, USA. 2016. 785–794. [9] 陈宇韶, 唐振军, 罗扬, 等. 皮尔森优化结合XGBoost算法的股价预测研究. 信息技术, 2018(9): 84-89. [10] 毛莺池, 曹海, 平萍, 等. 基于最大联合条件互信息的特征选择. 计算机应用, 2019, 39(3):734-741.