﻿ 基于最大相关熵准则的鲁棒度量学习算法
 计算机系统应用  2018, Vol. 27 Issue (10): 146-153 PDF

Robust Metric Learning Based on Maximum Correntropy Criterion
XIE Lin-Jiang, YIN Dong
School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
Abstract: Metric, also called distance function, is a special function in metric space that satisfies certain conditions. It is generally used to reflect some important distance relationships between data examples. Since distance has a great influence on various classification and clustering problems, metric learning has an important influence on these machine learning problems. Existing metric learning algorithms for classification problems are vulnerable to noise, the classification accuracy is not stable and tends to fluctuate. To solve this problem, this paper presents a robust metric learning algorithm based on maximum correntropy criterion. The core of maximum correntropy criterion is Gaussian kernel function, which is introduced into metric learning in this study. We construct a loss function with Gaussian kernel function and optimize the objective function using gradient descent method. The output metric matrix is computed through repeatedly testing and adjusting the parameters. The metric matrix learned through this method will have better robustness and will effectively improve the classification accuracy when dealing with various classification problems affected by noise. This study performs validation experiments on some popular machine learning datasets (UCI) and face datasets.
Key words: metric learning     noise     maximum correntropy criterion     Gaussian kernel function     robust

1 相关理论 1.1 度量学习

(1) $D\left( {\overrightarrow {{x_i}} ,\overrightarrow {{x_j}} } \right) + D\left( {\overrightarrow {{x_j}} ,\overrightarrow {{x_k}} } \right) \geqslant D\left( {\overrightarrow {{x_i}} ,\overrightarrow {{x_k}} } \right)$

(2) $D\left( {\overrightarrow {{x_i}} ,\overrightarrow {{x_j}} } \right) \geqslant 0$

(3) $D\left( {\overrightarrow {{x_i}} ,\overrightarrow {{x_j}} } \right) = D\left( {\overrightarrow {{x_j}} ,\overrightarrow {{x_i}} } \right)$

(4) $D\left( {\overrightarrow {{x_i}} ,\overrightarrow {{x_j}} } \right) = 0$ 等价 $\overrightarrow {{x_i}} = \overrightarrow {{x_j}}$

 $D\left( {\overrightarrow {{x_i}} ,\overrightarrow {{x_j}} } \right) = \left\| {\overrightarrow {{x_i}} - \overrightarrow {{x_j}} } \right\|_2^2$ (1)

 ${D_L}\left( {\overrightarrow {{x_i}} ,\overrightarrow {{x_j}} } \right) = \left\| {L\left( {\overrightarrow {{x_i}} - \overrightarrow {{x_j}} } \right)} \right\|_2^2$ (2)

 ${D_L}\left( {\overrightarrow {{x_i}} ,\overrightarrow {{x_j}} } \right) = {\left( {\overrightarrow {{x_i}} - \overrightarrow {{x_j}} } \right)^{\rm{T}}}{L^{\rm{T}}}L\left( {\overrightarrow {{x_i}} - \overrightarrow {{x_j}} } \right)$ (3)

 $M = {L^{\rm{T}}}L$ (4)

 ${D_M}\left( {\overrightarrow {{x_i}} ,\overrightarrow {{x_j}} } \right) = {\left( {\overrightarrow {{x_i}} - \overrightarrow {{x_j}} } \right)^{\rm{T}}}M\left( {\overrightarrow {{x_i}} - \overrightarrow {{x_j}} } \right)$ (5)

1.2 相关熵

 ${V_\sigma }\left( {A,B} \right) = E\left[ {{k_\sigma }\left( {A - B} \right)} \right]$ (6)

 ${V_{m,\sigma }}\left( {A,B} \right) = \frac{1}{m}\sum\limits_{j = 1}^m {{k_\sigma }} \left( {{A_j} - {B_j}} \right)$ (7)

 \begin{aligned} CIM\left( {A,B} \right) & = {\left( {g\left( 0 \right) - \frac{1}{m}\sum\limits_{j = 1}^m {g\left( {{e_j}} \right)} } \right)^{\frac{1}{2}}} \\& = {\left( {g\left( 0 \right) - \frac{1}{m}\sum\limits_{j = 1}^m {g\left( {{a_j} - {b_j}} \right)} } \right)^{\frac{1}{2}}} \\ \end{aligned} (8)

 $\mathop {\max }\limits_\theta \frac{1}{m}\sum\limits_{j = 1}^m {g\left( {{e_j}} \right)}$ (9)

2 本文方法

2.1 损失函数

 $X = A - B$ (10)

 $\mathop {\max }\limits_L \left( {\frac{1}{n}\sum\limits_{t = 1}^m {{y_t}\sum\limits_{i = 1}^n {g\left( {\overrightarrow {{L_i}} \cdot {{\overrightarrow {{x_t}} }^{\rm{T}}}} \right)} } } \right)$ (11)

 $\mathop {\max }\limits_L \left( {\frac{1}{n}\sum\limits_{t = 1}^m {{y_t}\sum\limits_{i = 1}^n {g\left( {\overrightarrow {{L_i}} \cdot {{\overrightarrow {{x_t}} }^{\rm{T}}}} \right)} } - \lambda \left\| L \right\|_F^2} \right)$ (12)

 $\mathop {\min }\limits_L \left( { - \frac{1}{n}\sum\limits_{t = 1}^m {{y_t}\sum\limits_{i = 1}^n {g\left( {\overrightarrow {{L_i}} \cdot {{\overrightarrow {{x_t}} }^{\rm{T}}}} \right)} } + \lambda \left\| L \right\|_F^2} \right)$ (13)

$\left( {{x_t},{y_t}} \right)$ 是处理后的样本集, ${x_t}$ 是两个个体之差异. 当两个个体是同类时, ${y_t} = 1$ ; 当两个个体是异类时, ${y_t} = - 1$ .

 $g\left( x \right) = \frac{1}{{\sqrt {2\pi } \sigma }}\exp \left( { - \frac{{{x^2}}}{{2{\sigma ^2}}}} \right)$ (14)
2.2 优化方法

(1)确定当前位置的损失函数的梯度, 对于 ${\theta _i}$ ,其梯度表达式如下:

 ${{\partial f\left( {{\theta _1},{\theta _2}, \cdots ,{\theta _n}} \right)}/{\partial {\theta _i}}}$

(2)用步长a乘以损失函数的梯度, 得到当前位置下降的距离, 即 ${{a * \partial f\left( {{\theta _1},{\theta _2}, \cdots ,{\theta _n}} \right)}/{\partial {\theta _i}}}$ .

(3)确定是否所有的 ${\theta _i}$ , 其梯度下降的距离都小于 $\varepsilon$ , 如果小于 $\varepsilon$ 则算法终止, 当前所有的 ${\theta _i}\left( {i = 1,}\right.$ $\left.{2, \cdots ,n} \right)$ 即为最终结果. 否则就进入步骤(4).

(4)更新所有的 $\theta$ , 对于 ${\theta _i}$ , 其更新表达式如下. 更新完毕后继续转入步骤(1).

 ${{{\theta _i} = {\theta _i} - a * \partial f\left( {{\theta _1},{\theta _2}, \cdots ,{\theta _n}} \right)}/{\partial {\theta _i}}}$

 $\mathop {\min }\limits_L \left( { - \frac{1}{n}\sum\limits_{t = 1}^m {{y_t}\sum\limits_{i = 1}^n {\exp \left( { - \frac{{{{\left( {\overrightarrow {{L_i}} \cdot {{\overrightarrow {{x_t}} }^{\rm{T}}}} \right)}^2}}}{{2{\sigma ^2}}}} \right)} } + \lambda \left\| L \right\|_F^2} \right)$ (15)

 $\frac{1}{{\sqrt {2\pi } {\sigma ^3}}}\frac{1}{n}\sum\limits_{t = 1}^m {\overrightarrow {{x_t}} } \left( {\overrightarrow {{L_i}} \cdot {{\overrightarrow {{x_t}} }^{\rm{T}}}} \right){y_t}\exp \left( { - \frac{{{{\left( {\overrightarrow {{L_i}} \cdot {{\overrightarrow {{x_t}} }^{\rm{T}}}} \right)}^2}}}{{2{\sigma ^2}}}} \right) + 2\lambda \overrightarrow {{L_i}}$ (16)
2.3 模型具有良好鲁棒性原因分析

 图 1 高斯函数

 ${K_\sigma }\left( {{x_1},{x_2}} \right) = \frac{1}{{\sqrt {2\pi } \sigma }}\exp \left( { - \frac{{{{\left\| {{x_1} - {x_2}} \right\|}^2}}}{{2{\sigma ^2}}}} \right)$ (17)

2.4 算法设计

1) 样本转化为样本对 $\scriptstyle\left( {{x_j} - {x_k},{y_t}} \right)$ , 当样本对中两个样本是同类时, $\scriptstyle{y_t} = 1$ ; 当两个样本是异类时, $\scriptstyle{y_t} = - 1$ .

2) 初始化L为单位阵I, 依据公式(15)计算初始损失函数值loss.

3) 选择学习速率r开始梯度下降, 得到新的目标度量矩阵L_new和损失函数loss1.

4) 再次梯度下降, 得到新的目标度量矩阵和损失函数loss2.

5) 重复2)、3)步骤, 直到满足 $\scriptstyle {{\left| {loss1 - loss2} \right|}/}$ $\scriptstyle{{\left| {loss2 - loss} \right|}} < \varepsilon$ .

6) 最后得到的L_new也就是需要的最后输出L.

3 实验结果与分析 3.1 参数设置

$\lambda$ 作为正则项系数, 可以控制模型的复杂程度. $\sigma$ 作为高斯核参数, 可以控制损失函数对距离变化的敏感程度, 学习速度r可以调整模型的拟合时间, 好的速度可以使模型快速拟合, 又不至于过拟合或者陷入死循环, 收敛判定条件 $\varepsilon$ 可以给定合适的拟合临界点, 使得拟合的模型更好的收敛, 更好的完成分类任务.

3.2 对比实验简介及对应参数设置

ITML是Davis JV等人在文献[12]中提出的与信息论相关的度量学习算法. 算法的思想是在距离函数约束条件下将两个多元高斯之间的差分相对熵(KL散度)最小化, 从而形成待解问题. 然后将这个问题转化为一个特定的Bregman优化问题来表达求解. 实验中需要设置的参数是松弛系数. 实验时设置松弛系数按10的指数阶变化, 然后通过验证集选出最佳松弛系数, 最后应用到测试集中. 由于ITML每次实验得到的结果并不固定,因此每组做10次实验, 然后取平均值.

LMNN是Weinberger KQ等人在文献[15]中提出的经典度量学习算法. 算法的核心思想在于, Metric是以k个最近邻总是属于同一个类, 而不同类的例子是大幅分开为目标来进行训练的. 另外此方法在处理多分类的问题时不需要修改或扩展. 实验中该方法的参数直接通过验证集学习到, 然后应用到测试集中. 由于LMNN每次实验得到的结果也不固定, 同样每组做10次实验, 然后取平均值.

RDML是Jin R等在文献[18]中提出的度量学习算法. 算法提出在适当的约束下, 正则化距离度量学习可以独立于维度, 使其适合处理高维数据. 在实验中需要设置的参数是正则项系数. 实验时设置正则项系数按10的指数阶变化, 然后通过验证集选出最佳正则项系数, 最后应用到测试集中.

3.3 UCI标准数据集上的实验结果

 图 2 分类准确率随加噪特征百分比变化图

 图 3 分类准确率随着加噪样本数量百分比变化图

3.4 人脸数据集YALEB实验

 图 4 人脸分类准确率随人脸加噪特征百分比变化图

4 总结

 [1] 周志华. 机器学习. 北京: 清华大学出版社, 2016. [2] 李航. 统计学习方法. 北京: 清华大学出版社, 2012. [3] 谢剑斌, 兴军亮, 张立宁, 等. 视觉机器学习20讲. 北京: 清华大学出版社, 2015. [4] Yang L. Distance metric learning: A comprehensive survey [Thesis]. Michigan: Michigan State University, 2006. [5] He R, Zheng WS, Hu BG. Maximum correntropy criterion for robust face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(8): 1561-1576. DOI:10.1109/TPAMI.2010.220 [6] Zhao W, Chellappa R, Phillips PJ, et al. Face recognition: A literature survey. ACM Computing Surveys, 2003, 35(4): 399-458. DOI:10.1145/954339 [7] 沈媛媛, 严严, 王菡子. 有监督的距离度量学习算法研究进展. 自动化学报, 2014, 40(12): 2673-2686. [8] 战扬, 金英, 杨丰. 基于监督的距离度量学习方法研究. 信息技术, 2011(12): 21-23. DOI:10.3969/j.issn.1009-2552.2011.12.006 [9] Saul LK, Roweis ST. Think globally, fit locally: Unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 2003, 4: 119-155. [10] Tenenbaum JB, De Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science, 2000, 290(5500): 2319-2323. DOI:10.1126/science.290.5500.2319 [11] Belkin M, Niyogi P. Laplacian eigenmaps and spectral techniques for embedding and clustering. NIPS’01 Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. Vancouver, British Columbia, Canada. 2001. 585–591. [12] Davis JV, Kulis B, Jain P, et al. Information-theoretic metric learning. Proceedings of the 24th International Conference on Machine Learning. Corvalis, OR, USA. 2007. 209–216. [13] Globerson A, Roweis ST. Metric learning by collapsing classes. In: Weiss Y, Sch lkopf B, Platt J, eds. Advances in Neural Information Processing Systems. Cambridge, MA, USA. MIT Press, 2006. 451–458. [14] Goldberger J, Roweis S, Hinton G, , et al. Neighbourhood components analysis. In: Saul LK, Weiss Y, Bottou L, eds. Advances in Neural Information Processing Systems. Cambridge, MA, USA. MIT Press, 2005. 513–520. [15] Weinberger KQ, Saul LK. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 2009, 10: 207-244. [16] Tsang IW, Cheung PM, Kwok JT. Kernel relevant component analysis for distance metric learning. Proceedings of 2005 IEEE International Joint Conference on Neural Networks. Montreal, QB, Canada. 2005. 954–959. [17] Kim TK, Kittler J. Locally linear discriminant analysis for multimodally distributed classes for face recognition with a single model image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(3): 318-327. DOI:10.1109/TPAMI.2005.58 [18] Jin R, Wang SJ, Zhou Y. Regularized distance metric learning: Theory and algorithm. In: Bengio Y, Schuurmans D, Lafferty JD, et al, eds. Advances in Neural Information Processing Systems. Bethesda, MD, USA. MIT Press, 2009. [19] Erdogmus D, Principe JC. An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems. IEEE Transactions on Signal Processing, 2002, 50(7): 1780-1786. DOI:10.1109/TSP.2002.1011217 [20] Liu WF, Pokharel PP, Principe JC. Correntropy: Properties and applications in non-Gaussian signal processing. IEEE Transactions on Signal Processing, 2007, 55(11): 5286-5298. DOI:10.1109/TSP.2007.896065 [21] Shalev-Shwartz S, Singer Y, Ng AY. Online and batch learning of pseudo-metrics. Proceedings of the Twenty-first International Conference on Machine Learning. Banff, Alberta, Canada. 2004. 94. [22] Bousquet O, Elisseeff A. Stability and generalization. Journal of Machine Learning Research, 2002, 2: 499-526.