Abstract:In order to meet the needs of science and technology policy research, China Association for Science and Technology designs and implements a policy database system. This study first introduces the overall design scheme and system workflow of the science and technology policy database. Then it introduces the system components in detail. The system consists of three subsystems:data acquisition subsystem, data cleaning subsystem and data analysis subsystem. The data acquisition subsystem is based on the Scrapy framework for designing manageable web crawlers for a large number of heterogeneous sites, as well as ABBYY FineReader-based OCR (Optical Character Recognition) for historical documentation. The data cleaning subsystem implements functions such as data deduplication, non-correlated data identification, and data attribute defect recognition based on machine learning algorithms. The data analysis subsystem further carries out text classification, association analysis and full-text search for the effective policies. Since its launch in October 2018, the system has collected 564 749 pieces of data from 226 data sources. After data cleaning, it stores 404 083 pieces of data, which can strongly support the research of science and technology policy.