﻿ 协同过滤算法中冷启动问题研究
 计算机系统应用  2019, Vol. 28 Issue (2): 246-252 PDF

Research on Cold-Start Problem of Collaborative Filtering Algorithm
SHAO Yu, XIE Ying-Hua
School of Information Science and Technology, Donghua University, Shanghai 201620, China
Abstract: In order to solve the cold-start problem of the traditional collaborative filtering algorithm and to improve the performance of recommendation, this study focuses on the cold-start problem and proposes two algorithms. Cold-start problem of new users: user-based collaborative filtering algorithm integrated with user’s information model, cold-start problem of new items: item-based collaborative filtering algorithm applying hierarchical clustering. After a series of experiments carried out on public data sets—MovieLens, comparing the difference between the precision and recall value of the improved algorithm and the traditional one, the results show that the new algorithm can effectively alleviate the cold start problem and improve the quality of recommendation.
Key words: collaborative filtering     cold-start     user characteristic     hierarchical clustering     similarities

1 传统的协同过滤算法

1.1 冷启动问题

1.2 冷启动问题的研究现状

Xu JW, Yao Y等提出了一种新型的评分比较策略(RaPare)[9]学习冷启动用户/项目的潜在性能, 通过寻找冷启动用户/项目和现有用户/项目之间的差异, 为冷启动用户/项目的潜在性能提供细粒度校准. Nguyen VD, Sriboonchitta S, Huynh VN引入了一种用户社交网络和软评分相结合的协同过滤推荐系统[10], 通过用户社交网络提取社区特征来解决冷启动问题. Katarya R, Verma OP将计算相似性的不对称方法与矩阵分解和基于典型性的协同过滤(Tyco)相结合[11], 实现了一种改进的电影推荐算法. 改进算法采用了Pearson相关系数计算相似度, 用线性回归进行预测, 得到更好的推荐结果.

2 相似度的计算方法 2.1 传统的相似度算法

 $sim(u,j) = \frac{{\displaystyle\sum\limits_{i = 1}^n {{u_i} \cdot {j_i}} }}{{\sqrt {\displaystyle\sum\limits_{i = 1}^n {{{\left( {{u_i}} \right)}^2}} } \cdot \sqrt {\displaystyle\sum\limits_{i = 1}^n {{{({j_i})}^2}} } }}$ (1)

 ${{d}}\left( {u,j} \right) = \sqrt {\sum {{{\left( {{u_i} - {j_i}} \right)}^2}} }$ (2)
 ${{sim}}\left( {u,j} \right) = \frac{1}{{1 + d\left( {u,j} \right)}}$ (3)

Pearson相关系数:

 $sim\left( {u,j} \right) = \frac{{\displaystyle\sum\limits_{i \in {I_{u,j}}}^n {\left( {{R_{u,i}} - \overline {{R_u}} } \right)} \left( {{R_{j,i}} - \overline {{R_j}} } \right)}}{{\sqrt {\displaystyle\sum\limits_{i \in {I_{u,j}}}^n {{{\left( {{R_{u,i}} - \overline {{R_u}} } \right)}^2}} } \sqrt {\displaystyle\sum\limits_{i \in {I_{u,j}}}^n {{{\left( {{R_{j,i}} - \overline {{R_j}} } \right)}^2}} } }}$ (4)

2.2 改进后的算法 2.2.1 融合用户信息模型的基于用户的协同过滤算法

 $attr(u,v) = \sum\limits_{i = 1}^k {{\lambda _i} \cdot att{r_i}}$ (5)

 $\sum\limits_{i = 1}^k {{\lambda _{{i}}}} = 1$ (6)

 $s(x) = \frac{1}{{1 + {e^{ - x}}}}$ (7)
 $si{m_{attr}}\left( {u,v} \right) = 2 \times (1 - \frac{1}{{1 + \exp ( - attr(u,v))}})$ (8)

 ${p_{i,j}} = \overline {{p_i}} + \frac{{\sum\nolimits_{u \in C} {sim({u_i},{u_j}) \times ({R_{u,i}} - \overline {{R_u}} )} }}{{\sum\nolimits_{u \in C} {\left| {sim({u_i},{u_j})} \right|} }}$ (9)

 图 1 新用户冷启动算法

2.2.2 采用层次聚类的基于项目的协同过滤算法

(1) 数据初始化处理

(2) 计算欧式距离

 ${{d}}\left( {{{i}},j} \right) = \sqrt {\sum\limits_{k = 1}^n {{{({{{i}}_k} - {j_k})}^2}} }$ (10)

(3) 凝聚式层次聚类

 $\left\{\begin{array}{l} AVG.dis\tan ce(\{ b1,b5\} ,b2) = 3.525\\ AVG.dis\tan ce(\{ b1,b5\} ,b3) = 2.085\\ AVG.dis\tan ce(\{ b1,b5\} ,b4) = 3.46 \end{array}\right.$

 $\left\{\begin{array}{l} AVG.distance(\{ b2,b4\} ,b3) = 3.28\\ AVG.distance(\{ b1,b5\} ,\{ b2,b4\} ) = 3.4925 \end{array}\right.$

 图 2 书籍聚类结果树图示

 ${p_{ij}} = \overline {{p_j}} + \frac{{\displaystyle\sum\nolimits_{r \in S} {sim(j,r) \times \left( {{R_{i,r}} - \overline {{R_r}} } \right)} }}{{\displaystyle\sum\nolimits_{r \in S} {\left| {sim(j,r)} \right|} }}$ (11)

3 实验结果与分析 3.1 实验数据集

3.2 实验评价标准

 $Recall = \frac{{\displaystyle\sum\limits_{u \in U} {\left| {{R_{\left( u \right)}} \cap {T_{\left( u \right)}}} \right|} }}{{\displaystyle\sum\limits_{u \in U} {\left| {{T_{\left( u \right)}}} \right|} }}$ (12)
 $Precision = \frac{{\displaystyle\sum\limits_{u \in U} {\left| {{R_{\left( u \right)}} \cap {T_{\left( u \right)}}} \right|} }}{{\displaystyle\sum\limits_{u \in U} {\left| {{R_{\left( u \right)}}} \right|} }}$ (13)

 图 3 新项目冷启动算法

3.3 实验结果验证 3.3.1 新用户冷启动算法验证

MovieLens的用户集中包含了性别、年龄、职业三项用户属性, 本文分析这三项特征信息, 计算出用户之间的特征差 $attr(u,v)$ 如式(16).

 $attr(u,v) = \alpha \cdot sex + \beta \cdot age + \gamma \cdot occupation$ (14)

(1) 性别判定

MovieLens数据集中男性用户性别表示为M, 女性用户性别表示为F. 若两用户性别相同, sex取值为0, 若不同, sex取值为1.

(2) 年龄量化(表4)

(3) 职业量化(表5)

 ${w_{uv}} = \frac{{\displaystyle\sum\nolimits_{i \in {N_{(u)}} \cap {N_{(v)}}} {\frac{1}{{\log (1 + \left| {N(i)} \right|)}}} }}{{\sqrt {\left| {N(u)} \right|\left| {N(v)} \right|} }}$ (15)

 图 4 改进前后算法查全率随K值的变化

 图 5 改进前后算法查准率随K值的变化

3.3.2 新项目冷启动算法验证

Step 1. 对movie.data的数据初始化处理.

(1) 年份(year)关键词: 直接表示为iyjy.

(2) 派别(genres)关键词: 遍历电影所属的派别, 若两部电影有属于一个相同的派别, 则g值减1, 否则, g值保持不变(g初始值为3), 最后得到目标电影和其他电影派别上的距离g值.

Step 2. 欧式距离的计算.

Step 3. 层次聚类.

 图 6 改进前后ItemCF算法查全率随K值的变化

 图 7 改进前后ItemCF算法查准率随K值的变化

 ${w_{ij}} = \frac{{\left| {N(i) \cap N(j)} \right|}}{{\sqrt {\left| {N(i)} \right|\left| {N(j)} \right|} }}$ (16)

4 结论

(1) 针对新用户的冷启动, 算法提取了用户个人特征信息, 为用户信息建模, 调用Sigmoid函数求得基于用户特征模型的相似度.

(2) 对于新项目的冷启动, 算法提取项目的信息属性, 计算出欧式距离, 采用凝聚式层次聚类的方法, 找到目标项目的邻居项目, 计算预测评分, 完成推荐.

(3) 选用网络开源数据集MovieLens进行实验验证, 将新算法与传统算法多次实验对比. 结果表明, 在新用户和新项目的情况下, 新算法推荐结果的查全率、查准率都有所提升, 有效地缓解了传统协同过滤算法的冷启动问题, 改善了推荐质量.

 [1] Maes P. Agents that reduce work and information overload. Communications of the ACM, 1994, 37(7): 30-40. DOI:10.1145/176789.176792 [2] Guo YY, Liu QC. E-commerce personalized recommendation system based on multi-agent. Proceedings of the 7th International Conference on Fuzzy Systems and Knowledge Discovery. Yantai, China. 2010. 1999–2003. [3] Bobadilla J, Ortega F, Hernando A. A collaborative filtering similarity measure based on singularities. Information Processing & Management, 2012, 48(2): 204-217. [4] Hu R, Pu P. Enhancing collaborative filtering systems with personality information. Proceedings of the 5th ACM Conference on Recommender Systems. Chicago, IL, USA. 2011. 197–204. [5] Wang JW. A collaborative filtering systems based on personality information. Proceedings of 2015 International Industrial Informatics and Computer Engineering Conference. Xi’an, China. 2015. [6] Goldberg D, Nichols D, Oki BM, et al. Using collaborative filtering to weave an information tapestry. Communications of the ACM, 1992, 35(12): 61-70. DOI:10.1145/138859.138867 [7] 申辉繁. 协同过滤算法中冷启动问题的研究[硕士学位论文]. 重庆: 重庆大学, 2015. [8] 马卓. 协同过滤推荐算法的研究与改进[硕士学位论文]. 秦皇岛: 燕山大学, 2015. [9] Xu JW, Yao Y, Tong HH, et al. RaPare: A generic strategy for cold-start rating prediction problem. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(6): 1296-1309. DOI:10.1109/TKDE.2016.2615039 [10] Nguyen VD, Sriboonchitta S, Huynh VN. Using community preference for overcoming sparsity and cold-start problems in collaborative filtering system offering soft ratings. Electronic Commerce Research and Applications, 2017, 26: 101-108. DOI:10.1016/j.elerap.2017.10.002 [11] Katarya R, Verma OP. Effective collaborative movie recommender system using asymmetric user similarity and matrix factorization. Proceedings of 2016 International Conference on Computing, Communication and Automation. Noida, India. 2016. 71–75. [12] 乔雨, 李玲娟. 推荐系统冷启动问题解决策略研究. 计算机技术与发展, 2018, 28(2): 83-87. DOI:10.3969/j.issn.1673-629X.2018.02.019 [13] http://www.grouplens.org/node/73. [14] 程淑玉. 基于协同过滤算法的个性化推荐系统的研究[硕士学位论文]. 合肥: 合肥工业大学, 2010. [15] Breese JS, Heckerman D, Kadie C. Empirical analysis of predictive algorithms for collaborative filtering. Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence. Madison, WI, USA. 1998. 43–52.