﻿ 基于聚类和XGboost算法的心脏病预测
 计算机系统应用  2019, Vol. 28 Issue (1): 228-232 PDF

1. 南京烽火天地通信科技有限公司, 南京 210019;
2. 武汉邮电科学研究院, 武汉430074

Heart Disease Prediction Based on Clustering and XGboost
LIU Yu1, QIAO Mu2
1. Nanjing FiberHome World Communication Technology Co. Ltd., Nanjing 210019, China;
2. Wuhan Research Institute of Posts and Telecommunications, Wuhan 430074, China
Abstract: In the past decade, the incidence of heart disease has been on the rise and remains high in the world. If the physical examination indicators related to the human body can be extracted by computer measures, and the influence of different characteristics and their weights on heart disease can be analyzed through machine learning, it will play a key role in predicting and preventing heart disease. Therefore, a prediction method based on clustering and XGboost algorithm is proposed in this study. By preprocessing the data and distinguishing the features, the data sets are clustered by clustering algorithm, such as K-means. Finally, the XGboost algorithm is used to predict and analyze. The experimental results show that the proposed method based on clustering and XGboost algorithm is feasible and effective, which provides accurate and effective help for the application of medical recommendation.
Key words: heart disease prediction     clustering     machine learning     K-means     extreme gradient boosting

1 方法 1.1 数据预处理

(1) 二值类的数据

(2) 多值类的数据

(3) 去空的数据

1.2 聚类算法

K-means算法的基本思想[4]如下:

 $E = \sum\limits_{k = 1}^k {\sum\limits_{x \in {c_k}} {d{{\left( {d\left( {x,{m_k}} \right)} \right)}^2}} }$ (1)

 $\begin{array}{l}d\left( {x,{m_k}} \right) = \left\| {x,{m_k}} \right\|=\\ \sqrt {{{\left( {{x_1} - {m_k}} \right)}^2} + {{\left( {{x_2} - {m_k}} \right)}^2} + \cdots + {{\left( {{x_n} - {m_{kn}}} \right)}^2}} \end{array}$ (2)

1.3 XGboost算法

XGBoost 是一种改进的GBDT 算法, GBDT 是2001 年Friedman 等人提出的一种Boosting算法. 它是一种迭代的决策树算法, 该算法由多棵决策树组成, 所有树的结论加起来作为最终答案[6]. 而XGBoost算法与GBDT 有很大的区别. GBDT 在优化时只用到一阶导数, XGBoost 则同时用到了一阶导数和二阶导数, 同时算法在目标函数里将树模型复杂度作为正则项,用以避免过拟合[7].

XGBoost 算法目标函数:

 $J\left( {{f_t}} \right) = \sum\limits_{i = 1}^n {L\left( {{y_i},\hat y_i^{t - 1} + {f_t}\left( {{x_i}} \right)} \right) + \Omega \left( {{f_t}} \right) + C}$ (3)

 $f\left( {x + \Delta x} \right) \approx f\left( x \right) + f'\left( x \right)\Delta x + \frac{1}{2}f''\left( x \right)\Delta {x^2}$ (4)

 ${g_i} = \frac{{\partial L\left( {{y_i},\hat y_i^{t - 1}} \right)}}{{\partial \hat y_i^{t - 1}}},{h_i} = \frac{{{\partial ^2}L\left( {{y_i},\hat y_i^{t - 1}} \right)}}{{\partial \hat y_i^{t - 1}}}$ (5)

 $\Omega \left( {{f_t}} \right) = \gamma \cdot {T_t} + \lambda \frac{1}{2}\sum\limits_{j = 1}^T {w_j^2}$ (6)

 $J\left( {{f_t}} \right) \approx \sum\limits_{j = 1}^T {\left[ {\left( {\sum\limits_{i \in {I_j}} {{g_i}} } \right){w_j} + \frac{1}{2}\left( {\sum\limits_{i \in {I_j}} {{h_i}} + \lambda } \right)w_j^2} \right]} + \gamma \cdot T + C$ (7)

 $\omega _j^* = - \frac{{\displaystyle\sum\limits_{i \in {I_j}} {{g_i}} }}{{\displaystyle\sum\limits_{i \in {I_j}} {{h_i} + \lambda } }}$ (8)
 $J\left( {{f_t}} \right) = - \frac{1}{2}\displaystyle\sum\limits_{j = 1}^T {\frac{{{{\left( {\displaystyle\sum\limits_{i \in {I_j}} {{g_i}} } \right)}^2}}}{{\displaystyle\sum\limits_{i \in {I_j}} {{h_i}} + \lambda }}} + \gamma \cdot T$ (9)

2 模型流程 2.1 流程实施

 图 1 基于聚类和XGboost算法的心脏病预测模型

3 实验结果与分析 3.1 评价标准

 $Accuracy = \frac{{TP + TN}}{{TP + TN + FP + FN}}$
 $Precision = \frac{{TP}}{{TP + FN}}$
 ${\mathop{ Re}\nolimits} call = \frac{{TP}}{{TP + FP}}$
 ${F1} = \frac{{2 \times Precision \times Recall}}{{Precision + Recall}}$

3.2 结果比较分析

3.3 重要特征分析

 图 2 特征变量重要性

4 结论

 [1] 张潇, 韦增欣, 杨天山. GBDT组合模型在股票预测中的应用. 海南师范大学学报(自然科学版), 2018, 31(1): 73-80. [2] 蔡旺华. 运用机器学习方法预测空气中臭氧浓度. 中国环境管理, 2018, 10(2): 78-84. [3] 华辉有, 陈启买, 刘海, 等. 一种融合Kmeans和KNN的网络入侵检测算法. 计算机科学, 2016, 43(3): 158-162. [4] 周丽娟, 王慧, 王文伯, 等. 面向海量数据的并行KMeans算法. 华中科技大学学报(自然科学版), 2012, 40(S1): 150-152. [5] 王娟, 王翰虎, 陈梅. 基于模糊聚类循环迭代模型的心脏病预测方法. 郑州大学学报(理学版), 2007, 39(4): 137-140. DOI:10.3969/j.issn.1671-6841.2007.04.034 [6] 游德创, 莫赞. 基于模糊xgboost算法的银行信用评价研究. 信息通信, 2018(2): 37-38. DOI:10.3969/j.issn.1673-1131.2018.02.015 [7] 郑凯文, 杨超. 基于迭代决策树(GBDT)短期负荷预测研究. 贵州电力技术, 2017, 20(2): 82-84. [8] Xu CX, Yan JF, Yang L, et al. Context co-occurrence based relationship prediction in spatiotemporal data. Proceedings of 2018 International Conference on Computer Modeling, Simulation and Algorithm (CMSA 2018). Beijing, China. 2018. 63. [9] Xu ZG. Complex production process prediction model based on EMD-XGBOOST-RLSE. Proceedings of 2017 9th International Conference on Modelling, Identification and Control (ICMIC 2017). Kunming, China. 2017. 8. [10] 谢冬青, 周成骥. 基于Bagging策略的XGBoost算法在商品购买预测中的应用. 现代信息科技, 2017(6): 80-82. DOI:10.3969/j.issn.2096-4706.2017.06.032 [11] 张钰, 陈珺, 王晓峰, 等. Xgboost在滚动轴承故障诊断中的应用. 噪声与振动控制, 2017, 37(4): 166-170, 179. DOI:10.3969/j.issn.1006-1355.2017.04.032 [12] 杨修德, 王金梅, 张丽娜. XGBoost在超短期负荷预测中的应用. 电气传动自动化, 2017, 39(4): 21-25. DOI:10.3969/j.issn.1005-7277.2017.04.005 [13] 张昊, 纪宏超, 张红宇. XGBoost算法在电子商务商品推荐中的应用. 物联网技术, 2017, 7(2): 102-104. [14] 柴利达, 薛沁文, 毛娜, 等. 基于大数据的物资价格预测方法探索. 电力大数据, 2017, 20(12): 13-20. [15] Wang WZ, Shi YL, Lyu GF, et al. Electricity consumption prediction using XGBoost based on discrete wavelet transform. Proceedings of the 2nd International Conference on Artificial Intelligence and Engineering Applications (AIEA 2017). Wuhan, China. 2017. 14. [16] 叶倩怡, 饶泓, 姬名书. 基于Xgboost的商业销售预测. 南昌大学学报(理科版), 2017, 41(3): 275-281. DOI:10.3969/j.issn.1006-0464.2017.03.015