﻿ 基于深度学习的网站权威性预测
 计算机系统应用  2018, Vol. 27 Issue (8): 164-169 PDF

1. 中国科学院 计算机网络信息中心, 北京 100190;
2. 中国科学院大学, 北京 100049

Website Authority Prediction Based on Deep Learning
YANG Hai-Hua1,2, FENG Yang-De1, WANG Jue1, NIE Ning-Ming1, LIU Fang1, ZHANG Bo-Yao1
1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China;
2. University of Chinese Academy of Sciences, Beijing 100049, China
Foundation item: National Key R & D Program of China (2017YFB0203704)
Abstract: Website authority is generally measured by external links. The more high-quality external links are, the more authoritative the website or web page itself is. Evaluation website authoritative algorithm has PageRank and so on. However, the impact of such algorithms on the authority of the website is selective, making this method has some drawbacks. This study uses the method of deep learning, by mapping search terms and URLs into vectors, and then calculates the similarity between two vectors to judge the authority of different websites under a certain search term. The website with high similarity of calculation results is referred to as an authoritative site under the search term, so we can use another view to measure the authority of website. By comparing two different model experiments using Word2vec and LSTM, the experimental results on open datasets show that it is effective to use both models, and LSTM model is better than Word2vec model.
Key words: website authority     Word2vec     LSTM     Natural Language Processing (NLP)

1 方法及模型介绍

1.2 Word2vec模型训练词向量

Word2vec模型是Mikolov等人在NNLM[6]以及Log-Bilinear模型[7]基础上开发的工具, 分为连续Bag Of Words (CBOW)和连续Skip-gram两种模型. CBOW模型利用上下文中的若干词去预测当前词; 而Skip-gram模型恰好相反, 利用当前词预测上下文的若干词.

1.3 LSTM网络模型

LSTM网络是RNN的扩展, 它成功的解决了原始循环神经网络的缺陷, 成为当前最流行的RNN. LSTM的神经网络基本模块具有不同的结构, 这与传统的RNN不同, 传统的RNN模型的隐藏层只有一个状态, 即h, 它对于短期的输入非常敏感. LSTM基本模块中增加了一个新的单元状态C, 如图1所示.

 图 1 LSTM网络模块示意图

LSTM具有遗忘门(forget gate)、输入门(input gate)、和输出门(output gates)等三种门结构, 用以保持和更新细胞状态. 以下是三种门的具体作用方式:

(1) 遗忘门: 它决定了上一时刻的单元状态ct-1有多少保留到当前时刻ct.

 ${f_t} = \sigma \left( {{w_f} \cdot \left[ {{h_{t - 1}},{x_t}} \right] + {b_f}} \right)$ (1)

(2) 输入门: 它决定当前时刻网络的输入xt有多少保存到单元ct.

 ${i_t} = \sigma \left( {{w_i} \cdot \left[ {{h_{t - 1}},{x_t}} \right] + {b_i}} \right)$ (2)

(3) 输出门: 它是控制单元状态ct有多少输出到LSTM的当前输出值ht.

 ${\tilde c_t} = \tanh \left( {{w_c} \cdot \left[ {{h_{t - 1}},{x_t}} \right] + {b_c}} \right)$ (3)
 ${c_t} = {f_t} \circ {c_{t - 1}} + {i_t} \circ {\tilde c_t}$ (4)
 ${o_t} = \sigma \left( {{w_o} \cdot \left[ {{h_{t - 1}},{x_t}} \right] + {b_o}} \right)$ (5)
 ${h_t} = {o_t} \circ \tanh \left( {{c_t}} \right)$ (6)

LSTM网络模型已被成功地应用于图片/视频描述[1214]、文本/情感分类[1518]、机器翻译[19]、智能问答[20,21]等自然语言处理任务中. 由于LSTM网络通过记忆单元去学习从细胞状态中忘记信息、去更新细胞状态的信息, 具有学习文本序列中远距离依赖的特性, 所以很自然地想到使用LSTM网络模型学习本文需要的搜索词的向量表达.

2 模型具体设计 2.1 使用Word2Vec模型的实验设计

1)将所有搜索词切词. 将切词后的搜索词通过调用Word2vec模型训练, 得到最终的词向量表;

2)对于新来的一个搜索词, 通过切词, 然后通过查询由Word2vec生成的词向量表, 将切词后对应的词的向量相加作为该搜索词的向量表达, 记为QueryEM.例如: 搜索词为“我是中国人”, 切词后分别为“我”、“是”、“中国人”三个词, 然后通过查询词向量表获取三个相应的词向量, 然后将这三个词向量相加得到关于“我是中国人”这个搜索词的向量表达;

3)通过使用tf.nn.embedding_lookup函数查随机初始化的网址表得到URL1、URL2的向量表达, 分别记为URLBEMURLWEM, 并将其值设置为可训练的;

4)将Sigmoid函数作用于搜索词的向量与点击率高的网址向量相加的结果上, 其结果记为QBSCORE; 同理将Sigmoid函数作用于搜索词的向量与点击率低的网址向量相加的结果上, 其结果记为QWSCORE; QBSCOREQWSCORE的具体计算方式如下:

 ${Q_{BSCORE}} = f\{ \sum\limits_{i = 1}^n {(Quer{y_{EM}} \cdot UR{L_{BEM}})} \}$ (7)
 ${Q_{WSCORE}} = f\{ \sum\limits_{i = 1}^n {(Quer{y_{EM}} \cdot UR{L_{WEM}})} \}$ (8)

5)定义损失函数并记为Loss, 并以此优化模型, 公式(9)为其计算方式:

 $Loss = \sum\nolimits_{i = 1}^n {\rm{max}} \left({{{\left( {{Q_{BSCORE}} - {Q_{WSCORE}}} \right)},0}}\right)$ (9)

6)训练过程中, QBSCOREQWSCORE之间的差大于0的样本为正例, 小于0为负例. 每个Batchsize的平均准确率的计算方式为当前批次中正例的个数除以当前批次的总样本数.

 图 2 Word2vec模型实验的训练流程图

2.2 使用LSTM模型的实验设计

1) 使用tf.nn.embedding_lookup函数查随机初始化的词汇表, 然后获取搜索词切词后不同词的向量表达, 将其结果作为LSTM的输入, 记为QueryEm_init, 并将其值设置为可训练的;

2) 取LSTM模型最后一个时刻隐层输出为整个搜索词的向量表达, 即第二十个时刻隐层的输出为这个搜索词的向量表达, 记为QEM.

 图 3 LSTM模型实验的训练流程图

3 实验结果及分析 3.1 数据集、数据格式及评价指标

3.2 实验结果

 图 4 Word2vec模型实验

 图 5 LSTM模型实验

(1) 由于前者是先利用Word2vec模型训练得到了最终的词向量表, 所以在后续的训练过程中只需要针对网址表更新训练, 所以收敛时间会比后者快.

(2) 由于用户输入的查询词是具有先后的语义关系, 而LSTM模型本身的时序特性能够很好的结合查询词的这种语义关系, 而Word2vec考虑更多的是在维度空间上词与词之间的相似度, 所以通过LSTM模型获得的搜索词向量表达比Word2vec要更符合这本文中的应用场景, 导致平均准确率比使用Word2vec的实验要高.

4 结束语

 [1] Page L, Brin S, Winograd T. The PageRank citation ranking: Bringing order to the web. Stanford: Stanford InfoLab, 1998. 1–14. [2] 余凯, 贾磊, 陈雨强, 等. 深度学习的昨天、今天和明天. 计算机研究与发展, 2013, 50(9): 1799-1804. DOI:10.7544/issn1000-1239.2013.20131180 [3] 孙志军, 薛磊, 许阳明, 等. 深度学习研究综述. 计算机应用研究, 2012, 29(8): 2806-2810. DOI:10.3969/j.issn.1001-3695.2012.08.002 [4] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, NV, USA. 2013. 3111–3119. [5] Graves A, Mohamed AR, Hinton G. Speech recognition with deep recurrent neural networks. Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada. 2013. 6645–6649. [6] Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model. Journal of Machine Learning Research, 2003, 3: 1137-1155. [7] Mnih A, Hinton G. Three new graphical models for statistical language modelling. Proceedings of the 24th International Conference on Machine Learning. Corvalis, OR, USA. 2007. 641–648. [8] 王勤勤, 张玉红, 李培培, 等. 基于word2vec的跨领域情感分类方法. 计算机应用研究, 2018, 35(10). [在线出版] http://www.arocmag.com/article/02-2018-10-004.html. [9] Xue B, Fu C, Zhan SB. A Study on sentiment computing and classification of Sina Weibo with Word2vec. Proceedings of 2014 IEEE International Congress on Big Data. Anchorage, AK, USA. 2014. 358–363. [10] Sharma K, Kumar AC, Bhandarkar SM. Action recognition in still images using word embeddings from natural language descriptions. Proceedings of 2017 IEEE Winter Applications of Computer Vision Workshops. Santa Rosa, CA, USA. 2017. 58–66. [11] Lilleberg J, Zhu Y, Zhang YQ. Support vector machines and word2vec for text classification with semantic features. Proceedings of the 14th International Conference on Cognitive Informatics & Cognitive Computing. Beijing, China. 2015. 136–140. [12] Venugopalan S, Rohrbach M, Donahue J, et al. Sequence to sequence-video to text. Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile. 2015. 4534–4542. [13] Byeon W, Breuel TM, Raue F, et al. Scene labeling with LSTM recurrent neural networks. Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA. 2015. 3547–3555. [14] Vinyals O, Toshev A, Bengio S, et al. Show and tell: A neural image caption generator. Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA. 2015. 3156–3164. [15] 马胜蓝. 基于深度学习的文本检测算法在银行运维中应用. 计算机系统应用, 2017, 26(2): 184-188. DOI:10.15888/j.cnki.csa.005628 [16] Zhao Z, Chen WH, Wu XM, et al. LSTM network: A deep learning approach for short-term traffic forecast. IET Intelligent Transport Systems, 2017, 11(2): 68-75. DOI:10.1049/iet-its.2016.0208 [17] Liu PF, Qiu XP, Chen XC, et al. Multi-timescale long short-term memory neural network for modelling sentences and doucuments. Proceedings of Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal. 2005. 2326–2335. [18] Wang X, Liu YC, Sun CJ, et al. Predicting polarities of tweets by composing word Embeddings with long short-term memory. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China. 2015. 1343–1353. [19] Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems. Montreal, QB, Canada. 2014. 3104–3112. [20] Ghosh S, Vinyals O, Strope B, et al. Contextual LSTM (CLSTM) models for large scale NLP tasks. arXiv: 1602.06291, 2016. [21] Wang D, Nyberg E. A long short-term memory model for answer sentence selection in question answering. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China. 2015. 707–712.