﻿ 面向数值型敏感属性的隐私保护方案
 计算机系统应用  2019, Vol. 28 Issue (7): 184-190 PDF

Privacy Protectionscheme Scheme for Numerical Sensitive Attributes
WANG Tao, WEN Mi
School of Computer Science and Technology, Shanghai University of Electric Power, Shanghai 200090, China
Foundation item: National Natural Science Foundation of China (61572311, 61602295)
Abstract: As for that existing personalized privacy anonymous technology can not solve the problem that the numerical sensitive attribute is vulnerable to the proximity breach, an anonymous model called (εi, k)-anonymity model is proposed and the model is based on clustering technology. Firstly, the model divides the sensitive attribute values in ascending order into several sub-intervals based on the clustering method; then, it proposes an (εi, k)-anonymity principle for numerically sensitive attributes against proximity breach; finally, a maximum bucket-first algorithm is proposed to implement the (εi, k)-anonymity principle. The experimental results show that compared with the existing scheme used for resisting proximity breach, the information loss of the proposed anonymous scheme is reduced, the algorithm execution efficiency is improved and it can reduce the leakage risk of user privacy effectively.
Key words: privacypreserving     numerical sensitive attributes     proximity privacy     (εi, k)-anonymity model

1 引言

2 相关工作

3 面向数值型敏感属性(εi, k)-匿名模型

3.1 敏感属性值域区间划分算法

1)将各条记录的敏感属性值提取出来按照升序方式以此排列在数轴上.

2)计算数轴上每两个相邻点之间的相对欧氏距离 ${\dot d_{{{ij}}}}\left( {i = 1,2,\cdots, b - 1,j = i + 1} \right)$ .

3)求出平均相对欧氏距离: ${\dot {{d}}_{\text{平均}}} = \left( {\displaystyle\sum\nolimits_{i = 1}^{{{b}} - 1} {\dot d_{{ij}}} } \right)/$ $\left( {b - 1} \right)$ .

4)比较每两个敏感属性值间的相对欧氏距离 ${\dot d_{{\rm{ij}}}}$ (将第一个点和最后一个点默认为第一个值域区间的左端点和最后一个区间的右端点)与平均相对欧氏距离 ${{{\dot d}}_{\text{平均}}}$ 的大小, 如果 ${\dot d_{{{ij}}}} \le {{{\dot d}}_{\text{平均}}}$ , 则说明这两个点可以划分到同一个区间内, 这两点不作为值域区间端点, 如果 ${\dot d_{{{ij}}}} \ge {{{\dot d}}_{\text{平均}}}$ , 则说明这两点距离过大, 应在这两个点处对区间进行拆分, i点作为生成的左侧区间的右端点, j点作为生成的右侧区间的左端点.

5)当步骤4)全部完成后, 生成所有的敏感属性值域区间.

3.2 (εi, k)-匿名原则

1) 每个分组内包含的元组数|E|在区间[k, 2k]之间.

2)每个分组内至少包含kεi邻域没有交集的记录.

3.3 最大桶优先提取算法

1) 首先对原始表T中的记录根据其敏感属性值的大小进行升序排列.

2)第一次划分区间: 见3.1节敏感属性值域区间划分算法.

3)为每个值域区间设置相应的阈值εi(由于数值敏感度的影响, 值域区间内的数值越大, 相应的该区间的阈值设置越高).

4)第二次划分区间: 对每个值域区间以对应的阈值为间隔再次划分, 生成若干个桶.

5)根据桶内元组的数目, 对得到的桶按降序处理.

6)将前k个桶的第一条记录都提取出来.

7)判断: 如果这k个桶全部是非相邻的, 那么则可以直接将这k条记录放到第一个分组内. 如果这k个桶中含有相邻桶, 则需要进行判断, 如果这两个敏感属性值来自同一个值域区间, 那么比较其差值是否大于该区间设定的阈值, 如果大于则满足提取条件可以直接提取, 如果差值小于阈值, 那么值域较小桶内提取记录不变, 从值域较大桶内的第二条记录开始逐次提取, 直到满足两条记录的差值大于该区间设定的阈值. 如果无法找到这样的两条记录, 那么说明这两个桶内的数据分布过于接近, 则隐藏其中的一个桶, 提取第k+1个桶重新比较, 直到找到满足条件的k条记录为止.

8)提取结束之后, 将提取出来的记录从桶内剔除, 按桶内元组的数量将桶重新降序排序.

9)如果含有元组的桶的数量大于k个, 则重复执行6)至8)步骤, 每经过一个循环就会形成一个新的分组, 直到含有记录的桶的数目小于k个为止或者剩余k个但是提取不出满足条件的记录为止.

10)对于剩余桶内的记录, 将其插入到已经存在的分组内, 使得每个分组都满足(εi, k)- 匿名原则.

11)输出一个准标识符表格QIT和敏感属性表ST, 这样就得到了满足(εi, k)- 匿名原则的数据表格.

4 实验结果及分析

4.1 信息可用性分析

select count(*) from SAL

where A1∈ b1 and A2 ∈ b2 and …Aw–1 ∈bw–1 and Aw∈bw

 图 1 数据集大小对信息可用性的影响

 图 2 准标识符属性个数对信息可用性的影响

 图 3 ε(εi)值的变化对信息可用性的影响

4.2 算法执行效率分析

 图 4 准标识符属性个数对算法执行时间的影响

 图 5 数据集大小对算法执行时间的影响

5 结论

 [1] Sweeney L. k-anonymity: A model for protecting privacy . International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002, 10(5): 557-570. DOI:10.1142/S0218488502001648 [2] Pramanik I, Lau RYK, Zhang WP. K-anonymity through the enhanced clustering method. Proceedings of the 2016 IEEE 13th International Conference on e-Business Engineering. Macau, China, 2016, 85-91. [3] Gao YXN, Luo T, Li JF, et al. Research on K anonymity algorithm based on association analysis of data utility. Proceedings of the 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference. Chongqing, China, 2017, 426-432. [4] 吕品, 钟珞, 于文兵, 等. MA-Datafly: 一种支持多属性泛化的k-匿名方法 . 计算机工程与应用, 2013, 49(4): 138-140. DOI:10.3778/j.issn.1002-8331.1107-0183 [5] Machanavajjhala A, Kifer D, Gehrke J, et al. L-diversity: Privacy beyond k-anonymity . ACM Transactions on Knowledge Discovery from Data, 2007, 1(1): 3. DOI:10.1145/1217299 [6] Xiao XK, Yi K, Tao YF. The hardness and approximation algorithms for l-diversity. Proceedings of the 13th International Conference on Extending Database Technology. Lausanne, Switzerland, 2010, 135-146. [7] Yang GM, Li JZ, Zhang SX, et al. An enhanced l-diversity privacy preservation. Proceedings of the 2013 10th International Conference on Fuzzy Systems and Knowledge Discovery. Shenyang, China, 2013, 1115-1120. [8] Li NH, Li TC, Venkatasubramanian S. t-closeness: Privacy beyond k-anonymity and ℓ-diversity. Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering . Istanbul, Turkey, 2007, 106-115. [9] El Ouazzani Z, El Bakkali H. A new technique ensuring privacy in big data: Variable t-closeness for sensitive numerical attributes. Proceedings of the 2017 3rd International Conference of Cloud Computing Technologies and Applications. Rabat, Morocco, 2017, 1-6. [10] Li JX, Tao YF, Xiao XK. Preservation of proximity privacy in publishing numerical sensitive data. Proceedings of 2008 ACM SIGMOD International Conference on Management of Data. Vancouver, Canada, 2008, 473-486. [11] Zhang Q, Koudas N, Srivastava D, et al. Aggregate query answering on anonymized tables. Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering. Istanbul, Turkey, 2007, 116-125. [12] 韩建民, 于娟, 虞慧群, 等. 面向数值型敏感属性的分级l-多样性模型 . 计算机研究与发展, 2011, 48(1): 147-158. [13] Han JW, Kamber M, Pei J. 数据挖掘: 概念与技术. 范明, 孟小峰, 译. 北京: 机械工业出版社, 2012. 88–90. [14] 周志华. 机器学习. 北京: 清华大学出版社, 2016. 39. [15] Xiao XK, Tao YF. Anatomy: Simple and effective privacy preservation. Proceedings of the 32nd International Conference on Very Large Data Bases. Seoul, Korea, 2006.