﻿ 基于聚类和奖惩用户模型的协同过滤算法
 计算机系统应用  2020, Vol. 29 Issue (8): 135-143 PDF

Collaborative Filtering Algorithm Based on Clustering and Incentive/Penalty User Model
WU Qing-Yang, CHENG Xu, DENG Cheng-Peng, DING Hao-Xuan, ZHANG Hong, LIN Sheng-Hai
Automotive Data of China (Tianjin) Co. Ltd, Tianjin 300393, China
Abstract: Giving or recommending appropriate content based on the quality of experience is the most important in recommender systems. This study proposes a new CBCF (Clustering-Based CF) method using an Incentivized/Penalized User (IPU) model, which is thus easy to implement. The purpose of this study is to improve recommendation performance of accuracy, recall and F1-score by studying the differences of users’ preferences through IPU model. This study formulates a constrained optimization problem in which we aim to maximize the recall (or equivalently F1-score) for a given precision. To this end, users are divided into several clusters based on the actual rating data and Pearson correlation coefficient. Afterward, we give each item an incentive/penalty according to the preference tendency by users within the same cluster. Experiments show that under the condition of given accuracy, the recall rate of the proposed algorithm can be improved by about 50%.
Key words: clustering     collaborative filtering     F1-score     incentivized/penalized user model     Pearson correlation coefficient     recommender system

1 相关研究

1.1 CF算法

CF是推荐系统最常用的技术之一, 但在数据稀疏和冷启动等方面存在不足[5]. 如果用户评分矩阵过于稀疏, 那么预测评分就会不准确. 此外, CF很难对新用户或项目进行预测评分. 解决这两个问题有很大的挑战[6]. 文献[4]在如何提高CF推荐系统的预测精度上进行了研究.

1.2 聚类算法

1.3 基于聚类的推荐系统

1.4 性能指标分析及偏好预测方法

1.5 聚类

FCM聚类[16]通过系数 $w_{ij}^m$ 将对象 ${x_i}$ 划分到簇 ${c_j}$ , 使每个对象成为具有不同模糊隶属度的所有聚类的成员, 其中m是控制集群模糊程度的超参数. m越大, 簇越模糊. FCM聚类首先在给定多个聚类的情况下随机初始化每个聚类的中心点. 然后, 重复以下两个步骤直到两次迭代之间系数的变化小于给定的阈值. (1)计算每个簇的质心; (2)计算每个点在簇中的系数.

2 基于聚类和奖惩用户模型的推荐算法 2.1 问题定义

 ${\displaystyle {\overline{ C}}_c^i} = \frac{{\displaystyle\sum\nolimits_{u \in U_{i,c}} {r_{u,i}} }}{{|U_{i,c}|}}$ (1)

 图 1 基于IPU模型的CBCF算法的例子, 假设有2个项目和4个用户簇, 实心方形项目和实心圆形项目分别表示测试数据和训练数据

1.　if $\scriptstyle{\overline{ C}} _c^i \ge \gamma$ then

2.　　if $\scriptstyle{\mathop r\limits^ {\wedge}} _{u,i} \ge \beta$ then

3.　　　将项目i推荐给用户u

4.　　else 项目i不被推荐

5.　else

6.　　if $\scriptstyle{\mathop r\limits^ {\wedge}} _{u,i} \ge \alpha$ then

7.　　　将项目i推荐给用户u

8.　　　else 项目i不被推荐

9.　end

 $\left\{ \begin{split} & f_{\rm TP}^{u,i}(\alpha ,\beta ,\gamma ,\delta _{\rm pref}) = I_{[\gamma ,\infty )}({\overline{C}} _c^{u,i}) \cdot I_{[\beta ,\infty )}({\hat{r}} _{u,i}) \cdot I_{[\delta _{\rm pref},\infty )}(r_{u,i}) + I_{(0,\gamma )}({\overline{C}} _c^{u,i}) \cdot I_{[\alpha ,\infty )}({\hat{r}} _{u,i}) \cdot I_{[\delta _{\rm pref},\infty )}(r_{u,i}) \\ & f_{\rm FP}^{u,i}(\alpha ,\beta ,\gamma ,\delta _{\rm pref}) = I_{[\gamma ,\infty )}({\overline {C}} _c^{u,i}) \cdot I_{[\beta ,\infty )}({\hat{r}} _{u,i}) \cdot I_{(0,\delta _{\rm pref})}(r_{u,i}) + I_{(0,\gamma )}{\overline{C}} _c^{u,i}) \cdot I_{[\alpha ,\infty )}({\hat{r}} _{u,i}) \cdot I_{(0,\delta _{\rm pref})}(r_{u,i}) \\ & f_{\rm FN}^{u,i}(\alpha ,\beta ,\gamma ,\delta _{\rm pref}) = I_{[\gamma ,\infty )}({\overline{C}} _c^{u,i}) \cdot I_{(0,\beta )}({\hat{r}} _{u,i}) \cdot I_{[\delta _{\rm pref},\infty )}(r_{u,i}) + I_{(0,\gamma )}({\overline{C}} _c^{u,i}) \cdot I_{(0,\alpha )}({\hat{r}} _{u,i}) \cdot I_{[\delta _{pref},\infty )}(r_{u,i}) \\ & f_{\rm TN}^{u,i}(\alpha ,\beta ,\gamma ,\delta _{\rm pref}) = I_{[\gamma ,\infty )}({\overline{C}} _c^{u,i}) \cdot I_{(0,\beta )}({\hat{r}} _{u,i}) \cdot I_{(0,\delta _{pref})}(r_{u,i})+ I_{(0,\gamma )}({\overline{C}} _c^{u,i}) \cdot I_{(0,\alpha )}({\hat{r}} _{u,i}) \cdot I_{(0,\delta _{pref})}(r_{u,i}) \end{split} \right.$ (2)

 $\left\{ \begin{split} & precision(\alpha ,\beta ,\gamma ,\delta _{\rm pref}) = \frac{{\displaystyle\sum\nolimits_{(u,i) \in T} {f_{\rm TP}^{u,i}(\alpha ,\beta ,\gamma ,\delta _{\rm pref}} )}}{{\displaystyle\sum\nolimits_{(u,i) \in T} {f_{\rm TP}^{u,i}(\alpha ,\beta ,\gamma ,\delta _{\rm pref}} ) + \displaystyle\sum\nolimits_{(u,i) \in T} {f_{\rm FP}^{u,i}(\alpha ,\beta ,\gamma ,\delta _{\rm pref}} )}}\\ &recall(\alpha ,\beta ,\gamma ,\delta _{\rm pref}) = \frac{{\displaystyle\sum\nolimits_{(u,i) \in T} {f_{\rm TP}^{u,i}(\alpha ,\beta ,\gamma ,\delta _{\rm pref}} )}}{{\displaystyle\sum\nolimits_{(u,i) \in T} {f_{\rm TP}^{u,i}(\alpha ,\beta ,\gamma ,\delta _{\rm pref}} ) + \displaystyle\sum\nolimits_{(u,i) \in T} {f_{\rm FN}^{u,i}(\alpha ,\beta ,\gamma ,\delta _{\rm pref}} )}} \end{split}\right.$ (3)

 $F_1(\alpha ,\beta ,\gamma ,\delta _{\rm pref}) = \frac{{2precision \times recall}}{{precision + recall}}$ (4)

(1) γ=0(不使用聚类的CF算法): 从表1可以看出, TP=2, FP=1, FN=3. 因此, 使用式(3)计算准确率为2/3, 召回率为2/5.

(2) γ=3.0(本文提出的算法): 假设 $\alpha$ =4.5且 $\beta$ =3.5. 从表1可以得出, TP=4, FP=1, FN=1, 使用式(3)计算准确率为4/5, 召回率为4/5.

2.2 公式化

 $\left\{ \begin{array}{l} \mathop {\rm maximize}\limits_{\alpha ,\beta ,\gamma } \;\;\;F_1(\alpha ,\beta ,\gamma ) \;\; {\rm{or }}\;\;recall(\alpha ,\beta ,\gamma )\\ {\text{当}} \;\;precision(\alpha ,\beta ,\gamma ) \ge \delta _{precision},\alpha \ge \beta \end{array}\right.$ (5)

2.3 基于IPU模型的CBCF算法

CBCF算法通过用户聚类以及使用IPU模型分析用户间的偏好倾向从而进行推荐. 使用IPU模型的CBCF算法的核心是根据 ${\overline {C}} _c^i$ (用户簇 $Cc$ 中对项目i的偏好的均值)的结果对每个项目进行激励或惩罚. 由于评分矩阵 $R_{\rm CBCF}$ 中存在用户未评分项目, 因此无法准确计算用户向量之间的欧几里得距离(即 $R_{\rm CBCF}$ 中的行向量). 因此, 本文使用皮尔逊相关系数(PCC). PCC通过计算两个用户的共同评分之间的相关性来衡量其相似性, 两个用户u1和u2之间的相似度s(u1, u2)为:

 $s(u1,u2) = \frac{{\displaystyle\sum\nolimits_{i \in I_{u1} \cap I_{u2}} {(r_{u1,i} - {\mathop r\limits^ -} _{ u1}) \cdot } (r_{u2,i} - {\mathop r\limits^ -} _{ u2})}}{{\sqrt {\displaystyle\sum\nolimits_{i \in I_{u1} \cap I_{u2}} {{{(r_{u1,i} - {\mathop r\limits^ -} _{ u1})}^2}} } \cdot \sqrt {\displaystyle\sum\nolimits_{i \in I_{u1} \cap I_{u2}} {{{(r_{u2,i} - {\mathop r\limits^ -} _{ u2})}^2}} } }}$ (6)

1. 用户簇 $\scriptstyle C \in \{ C_1, \cdots ,Cc\}$ ;

2. 初始化n, x, m的用户评分矩阵 $\scriptstyle R_{\rm CBCF}$ ;

3. $\scriptstyle{\hat{ R}}$ $\leftarrow$ $\scriptstyle R_{\rm CBCF}$ 预测评分;

4. 初始化阈值 $\scriptstyle\alpha ,\;\beta ,\;\gamma$ ;

5. for $\scriptstyle u \leftarrow 1$ to n do

6. $\scriptstyle I_u \leftarrow$ 用户测试集中缺少评分的项目;

7. $\scriptstyle u;$

8. $\scriptstyle{\hat{ r}} _{u,Iu} \leftarrow$ 项目 $\scriptstyle I_u$ 的预测评分;

9. for $\scriptstyle i \leftarrow 1$ to $\scriptstyle|I_u|$ do

10. $\scriptstyle C_{\rm tmp} \leftarrow$ 用户u所属的用户簇;

11. $\scriptstyle{\overline{ C}} _{\rm tmp}^i \leftarrow$ 用户簇中的用户对项目i的评分均值;

12. if $\scriptstyle{\hat{ r}} _{u,i} \ge \alpha$ then

13. 将项目i推荐给用户u;

14. else if $\scriptstyle{\hat{ r}} _{u,i} \ge \beta$ && $\scriptstyle{\overline{ C}} _{\rm tmp}^i \ge \gamma$ then

15. 将项目i推荐给用户u;

16. else 项目i不被推荐;

17. end

18.end

(1)通过使用IPU模型以及CBCF算法来决策是否将项目i推荐给活跃用户u.

(2)当(即 ${\hat{ r}} _{u,i} \ge \alpha$ ), 那么将项目i推荐给用户u.

(3)当 ${\hat{ r}} _{u,i} \ge \beta$ ${\overline{ C}} _c^i \ge \gamma$ , 向用户u推荐项目i, ${\overline{ C}} _c^i$ 是用户簇 $C_c$ 对项目i的偏好均值.

3 实验 3.1 实验数据集

100 KB的数据集具有以下属性:

(1)评分最高为5星;

(2)每个用户至少有20条评分记录;

(3) 100 KB数据集有100 000条评分记录;

(4) 100 KB数据集有943个用户和1682部电影.

 $\left\{\begin{array}{l} U\mathop = \limits^{\Delta} \{ u_1,u_2,\cdots,u_n\} \\ I\mathop = \limits^{\Delta} \{ i_1,i_2,\cdots,i_m\} \\ \end{array} \right.$ (7)

 $R_{\rm CBCF} = \left( {\begin{array}{*{20}{c}} {r_{1,1}}&{r_{1,2}}&{r_{1,3}}& \cdots &{r_{1,m}} \\ {r_{2,1}}&{r_{2,2}}&{r_{2,3}}& \cdots &{r_{2,m}} \\ {r_{3,1}}&{r_{3,2}}&{r_{3,3}}& \cdots &{r_{3,m}} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ {r_{n,1}}&{r_{n,2}}&{r_{n,3}}& \cdots &{r_{n,m}} \end{array}} \right)$ (8)

 $U_b = [r_{b,1},r_{b,2}, \cdots, r_{b,m}]$ (9)

 $C = \{ C_1,C_2, \cdots ,C_c\}$ (10)

3.2 实验结果与分析

(1)实际评分为4.0或5.0的项目向用户推荐.

(2)实际评分低于4.0的项目不向用户进行推荐.

 $\left\{ \begin{array}{l} s(u_1,u_2) \leftarrow 1 - s(u_1,u_2)\; {\rm{for}} \;s(u_1,u_2) \in [0,1]\\ s(u_1,u_2) \leftarrow - (s(u_1,u_2) - 1)\;{\rm{for }}\;s(u_1,u_2) \in [ - 1,0] \end{array}\right.$ (11)
 图 2 簇间和簇内的欧几里德距离比较结果

 图 3 当阈值γ为3.4, α和β对F1-score的影响

 图 4 基于项目的CF, F1-score随着推荐阈值的变化趋势

4 结束语

 [1] 姜书浩, 张立毅, 张志鑫. 基于个性化的多样性优化推荐算法. 天津大学学报(自然科学与工程技术版), 2018, 51(10): 1042-1049. [2] Su XY, Khoshgoftaar TM. A survey of collaborative filtering techniques. Advances in Artificial Intelligence, 2009, 421425. [3] 朱叶. 面向海量数据的推荐系统的研究[硕士学位论文]. 北京: 北京理工大学, 2015. [4] Cai Y, Leung HF, Li Q, et al. Typicality-based collaborative filtering recommendation. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(3): 766-779. DOI:10.1109/TKDE.2013.7 [5] 翁小兰, 王志坚. 协同过滤推荐算法研究进展. 计算机工程与应用, 2018, 54(1): 25-31. [6] Sobhanam H, Mariappan AK. A hybrid approach to solve cold start problem in recommender systems using association rules and clustering technique. International Journal of Computer Applications, 2013, 74(4): 17-23. DOI:10.5120/12873-9697 [7] Andrat H, Ansari N. Analyzing game stickiness using clustering techniques. In: Bhatia SK, Mishra KK, Tiwari S, et al, eds. Advances in Computer and Computational Sciences. Singapore: Springer, 2018. 645–654. [8] Chowdhury K, Chaudhuri D, Pal AK. A novel objective function based clustering with optimal number of clusters. In: Mandal JK, Mukhopadhyay S, Dutta P, et al, eds. Methodologies and Application Issues of Contemporary Computing Framework. Singapore: Springer, 2018. 23–32. [9] Tehreem A, Khawaja SG, Khan AM, et al. Multiprocessor architecture for real-time applications using mean shift clustering. Journal of Real-Time Image Processing, 2019, 16(6): 2233-2246. DOI:10.1007/s11554-017-0733-0 [10] Huang CL, Yeh PH, Lin CW, et al. Utilizing user tag-based interests in recommender systems for social resource sharing websites. Knowledge-Based Systems, 2014, 56: 86-96. DOI:10.1016/j.knosys.2013.11.001 [11] Yin B, Yang YJ, Liu WH. Exploring social activeness and dynamic interest in community-based recommender system. Proceedings of the 23rd International Conference on World Wide Web. Seoul, Republic of Korea. 2014. 771–776. [12] Koohi H, Kiani K. User based collaborative filtering using fuzzy C-means. Measurement, 2016, 91: 134-139. DOI:10.1016/j.measurement.2016.05.058 [13] Wu Y, DuBois C, Zheng AX, et al. Collaborative denoising auto-encoders for top-N recommender systems. Proceedings of the 9th ACM International Conference on Web Search and Data Mining. New York, NY, USA. 2016. 153–162. [14] Yang Z, Wu B, Zheng K, et al. A survey of collaborative filtering-based recommender systems for mobile Internet applications. IEEE Access, 2016, 4: 3273-3287. DOI:10.1109/ACCESS.2016.2573314 [15] von Luxburg U. A tutorial on spectral clustering. Statistics and Computing, 2007, 17(4): 395-416. DOI:10.1007/s11222-007-9033-z [16] 闫岩. 一种基于情境聚类的协同过滤算法的研究与实现. [硕士学位论文]. 长沙: 湖南大学, 2016. [17] Xu T, Tian J, Murata T. Research on personalized recommendation in E-commerce service based on data mining. Proceedings of International MultiConference of Engineers and Computer Scientists. Hong Kong, China. 2013. 313–317. [18] Hwang WS, Parc J, Kim SW, et al. “Told you I didn’t like it”: Exploiting uninteresting items for effective collaborative filtering. Proceedings of the 2016 IEEE 32nd International Conference on Data Engineering. Helsinki, Finland. 2016. 349–360. [19] Pal NR, Bezdek JC. On cluster validity for the fuzzy c-means model. IEEE Transactions on Fuzzy Systems, 1995, 3(3): 370-379. DOI:10.1109/91.413225 [20] Jazayeriy H, Mohammadi S, Shamshirband S. A fast recommender system for cold user using categorized items. Mathematical and Computational Applications, 2018, 23(1): 1. DOI:10.3390/mca23010001