###
计算机系统应用英文版:2019,28(1):176-181
本文二维码信息
码上扫一扫!
爬虫系统中标签删除功能的设计及优化
(长沙商贸旅游职业技术学院 经济贸易学院, 长沙 410116)
Design and Improvement of Tag Deletion Function in Crawler
(College of Economics and Trade, Changsha Commerce & Tourism College, Changsha 410116, China)
摘要
图/表
参考文献
相似文献
本文已被:浏览 1583次   下载 1379
Received:July 16, 2018    Revised:August 10, 2018
中文摘要: 在用爬虫爬取到大型商品网站的大规模网页数据集后,要将网页数据集作进一步筛选以得到目标数据集,筛选之前要做的一项准备工作就是删除网页中多余的标签.为此,用递归算法的思想给出了标签删除的算法,提出了标签删除功能的软件设计思想,对设计进行了2次设计改进及性能优化,最终采用了1个缓冲区维系线程1个标签删除线程的双线程设计思想.实验表明,优化后的标签删除功能在单机上每1000个网页的平均处理时间只需19.7 s,处理20万个网页只需1.1小时.
Abstract:After crawling to obtain a data set of large web pages on a large commodity site, the data set is screened to further get the target data set. Before screening, preparation must to be done is to delete the redundant tags in the web pages. Therefore, the algorithm of deletion tag is given with the idea of a recursive algorithm. The design idea of tag deletion function is put forward. 2 time design improvements are carried out to optimize the performance. Finally, the design idea of dual thread is adopted. The dual threads are 1 maintain buffer thread and 1 tag deletion thread. In single computer environment, experiments show that the optimized tag deletion function only takes 19.7 seconds for each 1000 pages, and only 1.1 hours for 200 000 web pages.
文章编号:     中图分类号:    文献标志码:
基金项目:湖南省自然科学基金(2017JJ5064)
引用文本:
邓子云.爬虫系统中标签删除功能的设计及优化.计算机系统应用,2019,28(1):176-181
DENG Zi-Yun.Design and Improvement of Tag Deletion Function in Crawler.COMPUTER SYSTEMS APPLICATIONS,2019,28(1):176-181