﻿ 基于聚类和XGboost算法的心脏病预测
1. 南京烽火天地通信科技有限公司, 南京 210019;
2. 武汉邮电科学研究院, 武汉430074

Heart Disease Prediction Based on Clustering and XGboost
LIU Yu1, QIAO Mu2
1. Nanjing FiberHome World Communication Technology Co. Ltd., Nanjing 210019, China;
2. Wuhan Research Institute of Posts and Telecommunications, Wuhan 430074, China
Abstract: In the past decade, the incidence of heart disease has been on the rise and remains high in the world. If the physical examination indicators related to the human body can be extracted by computer measures, and the influence of different characteristics and their weights on heart disease can be analyzed through machine learning, it will play a key role in predicting and preventing heart disease. Therefore, a prediction method based on clustering and XGboost algorithm is proposed in this study. By preprocessing the data and distinguishing the features, the data sets are clustered by clustering algorithm, such as K-means. Finally, the XGboost algorithm is used to predict and analyze. The experimental results show that the proposed method based on clustering and XGboost algorithm is feasible and effective, which provides accurate and effective help for the application of medical recommendation.
Key words: heart disease prediction     clustering     machine learning     K-means     extreme gradient boosting

1 方法 1.1 数据预处理

(1) 二值类的数据

(2) 多值类的数据

(3) 去空的数据

1.2 聚类算法

K-means算法的基本思想[4]如下:

 $E = \sum\limits_{k = 1}^k {\sum\limits_{x \in {c_k}} {d{{\left( {d\left( {x,{m_k}} \right)} \right)}^2}} }$ (1)

 $\begin{array}{l}d\left( {x,{m_k}} \right) = \left\| {x,{m_k}} \right\|=\\ \sqrt {{{\left( {{x_1} - {m_k}} \right)}^2} + {{\left( {{x_2} - {m_k}} \right)}^2} + \cdots + {{\left( {{x_n} - {m_{kn}}} \right)}^2}} \end{array}$ (2)

1.3 XGboost算法

XGBoost 是一种改进的GBDT 算法, GBDT 是2001 年Friedman 等人提出的一种Boosting算法. 它是一种迭代的决策树算法, 该算法由多棵决策树组成, 所有树的结论加起来作为最终答案[6]. 而XGBoost算法与GBDT 有很大的区别. GBDT 在优化时只用到一阶导数, XGBoost 则同时用到了一阶导数和二阶导数, 同时算法在目标函数里将树模型复杂度作为正则项,用以避免过拟合[7].

XGBoost 算法目标函数:

 $J\left( {{f_t}} \right) = \sum\limits_{i = 1}^n {L\left( {{y_i},\hat y_i^{t - 1} + {f_t}\left( {{x_i}} \right)} \right) + \Omega \left( {{f_t}} \right) + C}$ (3)

 $f\left( {x + \Delta x} \right) \approx f\left( x \right) + f'\left( x \right)\Delta x + \frac{1}{2}f''\left( x \right)\Delta {x^2}$ (4)

 ${g_i} = \frac{{\partial L\left( {{y_i},\hat y_i^{t - 1}} \right)}}{{\partial \hat y_i^{t - 1}}},{h_i} = \frac{{{\partial ^2}L\left( {{y_i},\hat y_i^{t - 1}} \right)}}{{\partial \hat y_i^{t - 1}}}$ (5)

 $\Omega \left( {{f_t}} \right) = \gamma \cdot {T_t} + \lambda \frac{1}{2}\sum\limits_{j = 1}^T {w_j^2}$ (6)

 $J\left( {{f_t}} \right) \approx \sum\limits_{j = 1}^T {\left[ {\left( {\sum\limits_{i \in {I_j}} {{g_i}} } \right){w_j} + \frac{1}{2}\left( {\sum\limits_{i \in {I_j}} {{h_i}} + \lambda } \right)w_j^2} \right]} + \gamma \cdot T + C$ (7)

 $\omega _j^* = - \frac{{\displaystyle\sum\limits_{i \in {I_j}} {{g_i}} }}{{\displaystyle\sum\limits_{i \in {I_j}} {{h_i} + \lambda } }}$ (8)
 $J\left( {{f_t}} \right) = - \frac{1}{2}\displaystyle\sum\limits_{j = 1}^T {\frac{{{{\left( {\displaystyle\sum\limits_{i \in {I_j}} {{g_i}} } \right)}^2}}}{{\displaystyle\sum\limits_{i \in {I_j}} {{h_i}} + \lambda }}} + \gamma \cdot T$ (9)

2 模型流程 2.1 流程实施

 图 1 基于聚类和XGboost算法的心脏病预测模型

3 实验结果与分析 3.1 评价标准

 $Accuracy = \frac{{TP + TN}}{{TP + TN + FP + FN}}$
 $Precision = \frac{{TP}}{{TP + FN}}$
 ${\mathop{ Re}\nolimits} call = \frac{{TP}}{{TP + FP}}$
 ${F1} = \frac{{2 \times Precision \times Recall}}{{Precision + Recall}}$

3.2 结果比较分析

3.3 重要特征分析

 图 2 特征变量重要性

4 结论

