###

计算机系统应用英文版:2019,28(1):176-181

View/Add Comment 过刊浏览高级检索 HTML

←前一篇 | 后一篇→

码上扫一扫！

下载全文

爬虫系统中标签删除功能的设计及优化

邓子云

(长沙商贸旅游职业技术学院经济贸易学院, 长沙 410116)

Design and Improvement of Tag Deletion Function in Crawler

DENG Zi-Yun

(College of Economics and Trade, Changsha Commerce & Tourism College, Changsha 410116, China)

摘要

图/表

参考文献

相似文献

本文已被：浏览 1583次下载 1379次
Received:July 16, 2018 Revised:August 10, 2018

中文摘要: 在用爬虫爬取到大型商品网站的大规模网页数据集后，要将网页数据集作进一步筛选以得到目标数据集，筛选之前要做的一项准备工作就是删除网页中多余的标签.为此，用递归算法的思想给出了标签删除的算法，提出了标签删除功能的软件设计思想，对设计进行了2次设计改进及性能优化，最终采用了1个缓冲区维系线程1个标签删除线程的双线程设计思想.实验表明，优化后的标签删除功能在单机上每1000个网页的平均处理时间只需19.7 s，处理20万个网页只需1.1小时.

中文关键词: 标签删除功能递归算法双线程设计性能实验

Abstract:After crawling to obtain a data set of large web pages on a large commodity site, the data set is screened to further get the target data set. Before screening, preparation must to be done is to delete the redundant tags in the web pages. Therefore, the algorithm of deletion tag is given with the idea of a recursive algorithm. The design idea of tag deletion function is put forward. 2 time design improvements are carried out to optimize the performance. Finally, the design idea of dual thread is adopted. The dual threads are 1 maintain buffer thread and 1 tag deletion thread. In single computer environment, experiments show that the optimized tag deletion function only takes 19.7 seconds for each 1000 pages, and only 1.1 hours for 200 000 web pages.

keywords: tag deletion function recursive algorithm dual thread design performance experiment

文章编号： 中图分类号： 文献标志码：

基金项目:湖南省自然科学基金（2017JJ5064）

Author Name	Affiliation	E-mail
DENG Zi-Yun	College of Economics and Trade, Changsha Commerce & Tourism College, Changsha 410116, China	dengziyun@126.com

Author Name	Affiliation	E-mail
DENG Zi-Yun	College of Economics and Trade, Changsha Commerce & Tourism College, Changsha 410116, China	dengziyun@126.com

引用文本：
邓子云.爬虫系统中标签删除功能的设计及优化.计算机系统应用,2019,28(1):176-181
DENG Zi-Yun.Design and Improvement of Tag Deletion Function in Crawler.COMPUTER SYSTEMS APPLICATIONS,2019,28(1):176-181