本文已被:浏览 1865次 下载 2378次
Received:October 26, 2017 Revised:November 14, 2017
Received:October 26, 2017 Revised:November 14, 2017
中文摘要: 当前对于拥有海量数据的互联网,经常需要采集多个源站的结构化数据以用于数据分析、挖掘,而为不同网站定制数据采集程序的人工成本很高,本文提出了一种自动结构化网站数据的主题爬虫方案.以电商类网站为例,基于其具有统一层次结构、垂直领域拥有行业语料和规范的特点,从理论角度确定了结构化提取方案的可行性.提出相似重复检测和基于属性语义的标签匹配等算法,实现结构的分析和目标字段的匹配,并出于系统管理和调优的考虑,设计了预设匹配模板和结构分析结果复用机制.实际应用和错误率测试表明,本方案具有很强的可行性,能够大大减少人工编写的代码,错误率较低.设计思路可应用于其他领域的主题爬虫系统,快速获得多个站点的大量数据,将焦点更多地放在结构化数据的处理和信息挖掘.
Abstract:The Internet has a huge amount of data, someone often need to acquire structural data of multiple source station to support data analysis, disinterment. The artificial cost of different customized website data acquisition program is very high. This paper presented a scheme of automatic data structuring in web crawler. Taking an e-commerce website as an example, this paper confirmed the feasibility of structured extraction scheme from the theoretical point of view based on its unified hierarchical structure, vertical domain, and data corpus. This study proposed the similar duplicate detection and attribute based semantic label matching algorithm, implemented analyzing the structure and matching the target fields, and designed a preset matching template and the reuse mechanism of structural analysis results, for management and tuning the system. Practical application and error rate test show that this scheme is very feasible and can greatly reduce artificial coding, and the error rate is low. The design idea can be applied to the subject crawler system in other fields, and quickly obtain large amount of data from many sites, and let people focus more on structured data processing and information disinterment.
keywords: automatic data structuring crawler label matching multiple source station e-commerce website
文章编号: 中图分类号: 文献标志码:
基金项目:广东省教育厅青年创新人才项目(自然科学)(2016KQNCX092)
引用文本:
张倩,林安成,廖秀秀.自动结构化数据的电商网站主题爬虫研究.计算机系统应用,2018,27(7):90-95
ZHANG Qian,LIN An-Cheng,LIAO Xiu-Xiu.Research on Theme Crawler of E-Commerce Website Based on Automatic Data Structuring.COMPUTER SYSTEMS APPLICATIONS,2018,27(7):90-95
张倩,林安成,廖秀秀.自动结构化数据的电商网站主题爬虫研究.计算机系统应用,2018,27(7):90-95
ZHANG Qian,LIN An-Cheng,LIAO Xiu-Xiu.Research on Theme Crawler of E-Commerce Website Based on Automatic Data Structuring.COMPUTER SYSTEMS APPLICATIONS,2018,27(7):90-95