###

DOI:

计算机系统应用英文版:2012,21(3):111-115,124

View/Add Comment 过刊浏览高级检索 HTML

←前一篇 | 后一篇→

码上扫一扫！

下载全文

Hadoop 平台下海量数据排行榜过滤算法

黄德才, 陈欢

(浙江工业大学计算机科学与技术学院,杭州 310023)

Rankings Filtering Algorithm of Massive Data Based on Hadoop and its Application

HUANG De-Cai, CHEN Huan

(College of Computer Science & Technology, Zhejiang University of Technology, Hangzhou 310023, China)

摘要

图/表

参考文献

相似文献

本文已被：浏览 1576次下载 3421次
Received:July 06, 2011 Revised:August 24, 2011

中文摘要: 排行榜作为现代社会很受关注的一项事物深入大家的生活,但对于海量数据的排行,即使在分布式环境下,依然需要耗费大量硬件资源和很长的时间,有时甚至无法产出榜单。首先对贝叶斯方法进行了改进,提出了一种基于hadoop 分布式环境下的行榜海量数据过滤算法,该方法利用熵值理论对缺损数据进行填补得到完整数据;再利用改进的贝叶斯方法计算某商品当日销量进榜单的概率,并对概率低于概率阈值的商品数据进行过滤使其不参与排行计算,从而在确保排行榜精确度的前提下大大缩短榜单的产出时间。对淘宝网中400 万条销售记录数据进行实验仿真,结果验证了上述方法的有效性和优越性能。

中文关键词: 排行榜 Hadoop 海量数据熵贝叶斯

Abstract:Rankings as a popular production in modern society has gone deeply into everyone's life. For the rankings on massive data, it costs large consumption of hardware resources and time though running under the distributed environment, even may not be produced sometimes. This paper improves the Bayesian algorithm and proposes a rankings filtering algorithm of massive data based on hadoop. We first fill the missing data by entropy theory for getting the complete data. Then, we compute the probability in the sales volume on the very day by the improved Bayesian algorithm. If the probability is smaller than threshold, the goods would be filtered not to attend the ranking computation. Simulation on four million sales from Taobao shows the effectiveness and excellent property of the proposed algorithm.

keywords: rankings Hadoop massive data entropy bayes

文章编号： 中图分类号： 文献标志码：

基金项目:浙江省重大科技计划(2009C11024)

Author Name	Affiliation
HUANG De-Cai	College of Computer Science & Technology, Zhejiang University of Technology, Hangzhou 310023, China
CHEN Huan	College of Computer Science & Technology, Zhejiang University of Technology, Hangzhou 310023, China

Author Name	Affiliation
HUANG De-Cai	College of Computer Science & Technology, Zhejiang University of Technology, Hangzhou 310023, China
CHEN Huan	College of Computer Science & Technology, Zhejiang University of Technology, Hangzhou 310023, China

引用文本：
黄德才,陈欢.Hadoop 平台下海量数据排行榜过滤算法.计算机系统应用,2012,21(3):111-115,124
HUANG De-Cai,CHEN Huan.Rankings Filtering Algorithm of Massive Data Based on Hadoop and its Application.COMPUTER SYSTEMS APPLICATIONS,2012,21(3):111-115,124