﻿ 基于差分隐私保护的模糊C均值聚类推荐
 计算机系统应用  2018, Vol. 27 Issue (10): 189-195 PDF

Fuzzy C-Means Clustering Recommendation Based on Differential Privacy Protection
JIANG Zong-Li, QIAO Xiang-Mei
Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
Abstract: The users are classified by different membership degrees with fuzzy C-means clustering. A more accurate clustering effect has been obtained and the problem of low recommendation accuracy caused by hard clustering is solved. Aiming at the privacy leakage problem of recommendation algorithm, the Laplace noise is introduced into the fuzzy C-means clustering process, and the differential privacy protection based fuzzy C-means clustering recommendation is implemented. The experimental results show that the proposed algorithm can effectively improve the security of the recommended system with the good quality of the recommendation.
Key words: collaborative filtering     fuzzy C-means clustering     differential privacy

(1) 引入模糊C均值聚类算法, 解决硬聚类问题, 提高了算法的准确度.

(2) 通过给模糊C均值聚类过程添加Laplace噪声实现差分隐私保护.

(3) 在MovieLens数据集上进行实验, 验证了本算法的准确度和安全性.

1 基于差分隐私保护的模糊C均值聚类推荐

FCMC使用模糊划分, 隶属度矩阵U允许有取值在0~1之间的元素, 且通过数据归一化, 任一数据到所有聚类的隶属度总和等于1, 表示为式(1):

 $\sum\nolimits_{i = 1}^c {{u_{ij}}} = 1, \forall j = 1, 2, \cdots, n.$ (1)

FCMC的价值函数一般化形式如式(2):

 $J\left( {\mathit{{U}}, {\mathit{{H}}_\mathit{{1}}}, \cdots, {\mathit{{H}}_\mathit{{c}}}} \right) = \sum\nolimits_{i = 1}^c {\sum\nolimits_{j = 1}^n {u_{ij}^nd_{ij}^2} }$ (2)

 $\begin{split}& \widehat J\left( {U, {H_1}, \cdots, {H_c}, {\lambda _1}, \cdots, {\lambda _n}} \right)\\& = J\left( {\mathit{{U}}, {\mathit{{H}}_{\bf{1}}}, \cdots, {\mathit{{H}}_\mathit{{c}}}} \right) + \sum\limits_{j = 1}^n {{\lambda _j}\left( {\sum\limits_{i = 1}^c {{u_{ij}}- 1} } \right)} \\& = \sum\nolimits_{i = 1}^c {\sum\nolimits_{j = 1}^n {u_{ij}^md_{ij}^2} } + \sum\nolimits_{j = 1}^n {{\lambda _j}\left( {\sum\nolimits_{i = 1}^c {{u_{ij}}- 1} } \right)} \end{split}$ (3)

 ${h_i} = \sum\limits_{j = 1}^n {u_{ij}^m{x_j}} /\sum\limits_{j = 1}^n {u_{ij}^m}$ (4)
 ${u_{ij}} = 1/\sum\limits_{k = 1}^c {{{\left( {\frac{{{d_{ij}}}}{{{d_{kj}}}}} \right)}^{2/\left( {m - 1} \right)}}}$ (5)

(1) FCMC聚类过程中, 假设攻击者获取到每次迭代过程中各聚簇中心点和某个样本点的距离, 就可以通过这些数据推断出该样本点的具体属性值, 而且迭代次数越多, 数据样本属性越少, 其隐私暴露的越彻底.

(2) FCMC聚类过程中, 如果攻击者拥有最大背景知识, 即攻击者已知样本点所属的聚簇内除数据样本点以外的所有数据点和中心点, 就可以根据中心点计算公式推断出这个样本点的属性值.

 ${\rm{pr}}\left[{k\left( \mathit{{D}} \right) \in \mathit{{S}}} \right] \le {e^{\left( t \right)}} \times {\rm{pr}}\left[{k\left( {\mathit{{D'}}} \right) \in \mathit{{S}}} \right]$ (6)

 $f = \mathop {\max }\limits_{{\rm{D}}1, {\rm{D}}2} {\left\| {f\left( {{\mathit{{D}}_{\bf{1}}}} \right) - f\left( {{\mathit{{D}}_{\bf{2}}}} \right)} \right\|_1}$ (7)

(1) 初始化评分矩阵;

(2) 设置参数. 设置聚类的个数为k, 聚类停止条件参数为δ, 根据经验模糊指数m=2, 隶属度阈值为ƞ(0<ƞ<1), 最近邻参数为k, 设置推荐产品评分阈值为Ɵ;

(3) 用0~1的随机数初始化隶属矩阵U, 使其满足公式(1)的归一化条件;

(4) 根据公式(4)计算k个聚类中心d1, d2,…,dk, 对于1≤jk, 添加Laplace噪声, dj'=dj+Lap(b), 将其作为初始中心点;

(5) 根据公式(2)计算价值函数Jn(表示第n次迭代), 判断价值函数变化. 如果|Jn-Jn-1|<δ, 停止迭代, 执行(7);否则执行(6);

(6) 根据公式(5)计算新的隶属矩阵U, 执行(4);

(7) 遍历隶属矩阵U, 当用户到聚类的隶属度u>ƞ时, 将用户归为该聚类;

(8) 随机取样本用户, 根据模糊C均值聚类结果, 找到样本用户所属一个或多个聚类, 将所属聚类中的其他用户作为相似用户;

(9) 根据相似用户和样本用户的相似度排序, 得到样本用户的k个最近邻居;

(10)根据最近邻居计算样本用户对产品的评分;

(11)将评分大于Ɵ的产品推荐给样本用户.

2 实验系统的设计与实现

 图 1 系统结构图

(1) 数据预处理模块

(2) 推荐模型选取模块

(3) 推荐质量评估模块

(1) 选取合适数据集, 验证实验的有效性;

(2) 选取理想评价指标, 衡量推荐准确度;

(3) 为了使本文提出的新算法有比较理想的推荐效果, 通过给相关参数选取不同数值, 对比推荐结果的准确度, 从而选取理想参数;

(4) 为了验证新算法的有效性, 通过给图 1中系统结构的推荐模型选择本文提出的新算法和其他相关典型算法, 分别进行实验, 对比结果.

 图 2 基于差分隐私保护的模糊C均值聚类推荐流程图

 图 3 基于差分隐私保护的模糊C均值聚类处理流程

3 实验结果与分析 3.1 数据集

3.2 评价指标

(1) 均方根误差(RMSE), 越小意味着推荐越准确. 定义为:

 ${{RMSE}} = \sqrt {\sum\limits_{i = 1}^n {{{\left( {{\mathit{{p}}_\mathit{{i}}} - {\mathit{{q}}_\mathit{{i}}}} \right)}^2}/N} }$ (8)

(2) 平均绝对偏差(MAE)[5]指标通过计算预测的用户评分与实际的户评分之间的偏差度量预测的准确性, 越小意味着推荐越准确, 定义为:

 ${{MAE}} = \sum\limits_{i = 1}^n {\left| {{{{p}}_{{i}}} - {{{q}}_{{i}}}} \right|/N}$ (9)

(3) F-measure指标[25]评价推荐的质量, 由召回率和准确率将两者结合组成, 其中召回率反映待推荐项目被推荐的比率, 准确率表示算法推荐成功的比率. F-measure值越大推荐质量越高, 计算公式如下:

 $\begin{array}{l}{{F - measue}} = 2 \times recall \times prescison/recall + presison\end{array}$ (10)
3.3 实验比较和分析

 图 4 DPFCMC聚类个数对推荐质量的影响

 图 5 DPFCMC最近邻个数对推荐质量的影响

 图 6 DPFCMC与其他典型算法的RMSE值对比

 图 7 DPFCMC与其他典型算法的MAE值对比

 图 8 DPFCMC与其他典型算法的F-measure值对比

4 结束语

 [1] Su XY, Khoshgoftaar TM. A survey of collaborative filtering techniques. Advances in Artificial Intelligence, 2009, 2009: 421425. [2] Herlocker JL, Konstan JA, Terveen LG, et al. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems, 2004, 22(1): 5-53. [3] Goldberg D, Nichols D, Oki BM, et al. Using collaborative filtering to weave an information tapestry. Communications of the ACM, 1992, 35(12): 61-70. DOI:10.1145/138859.138867 [4] 邓爱林, 朱扬勇, 施伯乐. 基于项目评分预测的协同过滤推荐算法. 软件学报, 2003, 14(9): 1621-1628. [5] 张光卫, 李德毅, 李鹏, 等. 基于云模型的协同过滤推荐算法. 软件学报, 2007, 18(10): 2403-2411. [6] Zhang F, Chang HY. A collaborative filtering algorithm embedded BP network to ameliorate sparsity issue. 2005 International Conference on Machine Learning and Cybernetics. Guangzhou, China. 2005. 1839–1844. [7] Massa P, Avesani P. Trust-aware collaborative filtering for recommender systems. In: Meersman R, Tari Z, eds. On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE. Berlin, Heidelberg. Springer. 2004. 492–508. [8] Konstan JA, Riedl J, Smyth B. Proceedings of the 2007 ACM conference on Recommender systems. ACM Conference on Recommender Systems. Minneapolis, MN, USA. 2007. [9] Chowdhury M, Thomo A, Wadge WW. Trust-based infinitesimals for enhanced collaborative filtering. International Conference on Management of Data. Mysore, India. 2010. 9–12. [10] 曾春, 邢春晓, 周立柱. 个性化服务技术综述. 软件学报, 2002, 13(10): 1952-1961. [11] Borchers A, Herlocker J, Konstan J, et al. Ganging up on Information Overload. Computer, 1998, 31(4): 106-108. DOI:10.1109/2.666847 [12] Kim TH, Park SI, Yang SB. Improving prediction quality in collaborative filtering based on clustering. 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2008. Wi-Iat. Sydney, NSW, Australia. 2008. 704–710. [13] Berget I, Mevik BH, Næs T. New modifications and applications of fuzzy C-means methodology. Computational Statistics & Data Analysis, 2008, 52(5): 2403-2418. [14] Agrawal R, Srikant R. Privacy-preserving data mining. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York, NY, USA. 2000. 439–450. [15] Sweeney L. k-Anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002, 10(5): 557-570. DOI:10.1142/S0218488502001648 [16] Chen R, Mohammed N, Fung BCM, et al. Publishing setvalued data via differential privacy. Proceedings of the VLDB Endowment, 2012, 4(4): 1087-1098. [17] Sarathy R, Muralidhar K. Some additional insights on applying differential privacy for numeric data. In: Domingo-Ferrer J, Magkos E, eds. Privacy in Statistical Databases. Berlin: Springer, 2010, 210-219. [18] 彭慧丽, 张啸剑, 金凯忠. 基于差分隐私的社交推荐方法. 计算机科学, 2017, 44(S1): 395-398, 423. [19] 何明, 常盟盟, 吴小飞. 一种基于差分隐私保护的协同过滤推荐方法. 计算机研究与发展, 2017, 54(7): 1439-1451. [20] Li X, Lu X, Tian J, et al. Application of fuzzy c-means clustering in data analysis of metabolomics. Analytical Chemistry, 2009, 81(11): 4468-4468. DOI:10.1021/ac900353t [21] Dembélé D, Kastner P. Fuzzy C-means method for clustering microarray data. Bioinformatics, 2003, 19(8): 973-980. DOI:10.1093/bioinformatics/btg119 [22] Dwork C. Differential privacy: A survey of results. In: Agrawal M, Du D, Duan Z, et al, eds. Theory and Applications of Models of Computation. Berlin: Springer-Verlag, 2008. 1–19. [23] Dwork C, Mcsherry F, Nissim K, et al. Calibrating noise to sensitivity in private data analysis. Halevi S, Rabin T. Theory of Cryptography. Berlin: Springer-Verlag, 2006. 265–284. [24] 马宏伟, 张光卫, 李鹏. 协同过滤推荐算法综述. 小型微型计算机系统, 2009, 30(7): 1282-1288. [25] Goldberg K, Roeder T, Gupta D, et al. Eigentaste: A constant time collaborative filtering algorithm. Information Retrieval, 2001, 4(2): 133-151. DOI:10.1023/A:1011419012209