﻿ 基于CatBoost算法的糖尿病预测方法
 计算机系统应用  2019, Vol. 28 Issue (9): 215-218 PDF

1. 中国科学院大学, 北京 100049;
2. 中国科学院 沈阳计算技术研究所, 沈阳 110168

Diabetes Prediction Method Based on CatBoost Algorithm
MIAO Feng-Shun1,2, LI Yan2, GAO Cen2, WANG Mei-Ji2, Li Dong-Mei2
1. University of Chinese Academy of Sciences, Beijing 100049, China;
2. Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang 110168, China
Abstract: In recent decades, people’s living standards have improved significantly, but health awareness is still weak. Poor living habits and eating habits have led to a sharp increase in the number of people with diabetes. The complications caused by diabetes are a serious threat to people’s health. Because awareness rate of diabetes is low, many patients with diabetes fail to detect the disease in time, leading to complications. In this study, by analyzing the characteristics of diabetes, according to the characteristics of small sample size and easy to be missing, the IV value analysis is used for feature selection, and CatBoost, a new type of Boosting algorithm, is used to predict diabetes patients and achieves significant predictive effects.
Key words: diabetes     IV value analysis     feature selection     ensemble learning     CatBoost

1 算法描述

CatBoost是Boosting策略的一种实现方式, 它和lightGBM与Xgboost类似, 都属于GBDT类的算法. CatBoost在GBDT的基础上主要做了两点改进: 处理标称属性和解决预测偏移的问题, 从而减少过拟合的发生.

1.1 GBDT

GBDT算法是通过一组分类器的串行迭代, 最终得到一个强学习器, 以此来进行更高精度的分类[10]. 它 使用了前向分布算法, 弱学习器使用分类回归树(CART).

 ${h^t} = \mathop {\arg \min }\limits_{h \in H} EL\left( {\left( {y,{F^{t - 1}}\left( x \right) + h\left( x \right)} \right)} \right)$ (1)

GBDT使用损失函数的负梯度来拟合每一轮的损失的近似值, 式(2)中 ${g^t}(x,y)$ 表示的是上述梯度.

 ${g^t}(x,y) = \frac{{\partial L(y,s)}}{{\partial s}}{|_{s = {F^{t - 1}}(x)}}$ (2)

 ${h^t} = \mathop {\arg \min }\limits_{h \in H} E{( - {g^t}(x,y) - h(x))^2}$ (3)

 ${F^t}(x) = {F^{t - 1}}(x) + {h^t}$ (4)
1.2 CatBoost

 $\frac{{\displaystyle\sum\nolimits_{j = 1}^p {[{x_{j,k}} = {x_{i,k}}] \cdot \mathop Y\nolimits_i + a \cdot P} }}{{\displaystyle\sum\nolimits_{j = 1}^n {[{x_{j,k}} = {x_{i,k}}]} + a}}$ (5)

 图 1 Ordered boosting流程

2 实验分析

 图 2 糖尿病预测模型

2.1 数据采集

2.2 数据预处理

2.3 特征选择

IV值分析是常见的处理特征值的方法, 它衡量了某个特征对目标的影响程度. 其基本思想是根据该特征所命中黑白样本的比率与总黑白样本的比率, 来对比和计算其关联程度, 计算公式如下:

 $IV = \sum\limits_i^n {({P_{yi}} - {P_{ni}})*\ln \frac{{{P_{yi}}}}{{{P_{ni}}}}}$ (6)

2.4 模型预测

2.5 评价标准

 $accuracy = \frac{{TP + TN}}{{TP + TN + FP + FN}}$ (7)
 $precision = \frac{{TP}}{{TP + FP}}$ (8)
 $recall = \frac{{TP}}{{TP + FN}}$ (9)
 $F1 = \frac{{2{\rm{*}}precision*recall}}{{precision + recall}}$ (10)

2.6 试验结果与分析

3 总结

 [1] 白碧玉, 于琦, 苏闫兵, 等. 中国糖尿病研究论文合作分析. 中国药物与临床, 2017, 17(11): 1619-1621. [2] 王海鹏. 我国诊断糖尿病疾病经济负担趋势预测研究[博士学位论文]. 济南: 山东大学, 2013. [3] 苏萍, 杨亚超, 杨洋, 等. 健康管理人群2型糖尿病发病风险预测模型. 山东大学学报(医学版), 2017, 55(6): 82-86. DOI:10.6040/j.issn.1671-7554.0.2017.347 [4] 罗森林, 成华, 张铁梅, 等. 多维2型糖尿病实测数据的预处理技术. 计算机工程, 2004, 30(17): 178-181. DOI:10.3969/j.issn.1000-3428.2004.17.071 [5] 吴海云, 潘平, 何耀, 等. 我国成年人糖尿病发病风险评估方法. 中华健康管理学杂志, 2007, 1(2): 95-98. DOI:10.3760/cma.j.issn.1674-0815.2007.02.012 [6] 张洪侠, 郭贺, 王金霞, 等. 基于XGBoost算法的2型糖尿病精准预测模型研究. 中国实验诊断学, 2018, 22(3): 408-412. DOI:10.3969/j.issn.1007-4287.2018.03.008 [7] Bottou L. Large-Scale machine learning with stochastic gradient descent. Proceedings of the 19th International Conference on Computational Statistics Paris France. Keynote. 2010. 177–186. [8] Tan PN. Receiver operating characteristic. Liu L, Özsu MT. Encyclopedia of Database Systems. New York, NY, USA: Springer, 2013. 2349–2352. [9] Sau A, Bhakta I. Screening of anxiety and depression among the seafarers using machine learning technology. Informatics in Medicine Unlocked, 2018. DOI:10.1016/j.imu.2018.12.004 [10] Yang T, Chen WT, Cao GT. Automated classification of neonatal amplitude-integrated EEG based on gradient boosting method. Biomedical Signal Processing and Control, 2016, 28: 50-57. DOI:10.1016/j.bspc.2016.04.004