Abstract:The Internet has a huge amount of data, someone often need to acquire structural data of multiple source station to support data analysis, disinterment. The artificial cost of different customized website data acquisition program is very high. This paper presented a scheme of automatic data structuring in web crawler. Taking an e-commerce website as an example, this paper confirmed the feasibility of structured extraction scheme from the theoretical point of view based on its unified hierarchical structure, vertical domain, and data corpus. This study proposed the similar duplicate detection and attribute based semantic label matching algorithm, implemented analyzing the structure and matching the target fields, and designed a preset matching template and the reuse mechanism of structural analysis results, for management and tuning the system. Practical application and error rate test show that this scheme is very feasible and can greatly reduce artificial coding, and the error rate is low. The design idea can be applied to the subject crawler system in other fields, and quickly obtain large amount of data from many sites, and let people focus more on structured data processing and information disinterment.