###
计算机系统应用英文版:2019,28(7):58-64
本文二维码信息
码上扫一扫!
科技政策库的系统集成与建设
(1.中国科协创新战略研究院, 北京 100086;2.北京航空航天大学, 北京 100083)
System Integration and Construction of Science and Technology Policy Database
(1.National Academy of Innovation Strategy, Beijing 100086, China;2.Beihang University, Beijing 100083, China)
摘要
图/表
参考文献
相似文献
本文已被:浏览 2376次   下载 2508
Received:January 03, 2019    Revised:January 24, 2019
中文摘要: 为了满足科技政策研究需要,中国科协设计并实现了一种科技政策库系统.本文首先介绍了科技政策库的总体设计方案、系统工作流程;然后详细介绍了系统组成,整个系统由数据采集子系统、数据清洗子系统、数据分析子系统3个子系统组成.数据采集子系统基于网络爬虫框架Scrapy软件针对大量异构站点设计了可管理的网络爬虫,并基于ABBYY FineReader软件(俄罗斯软件公司ABBYY发行的一款文档识别软件)实现了历史文献OCR识别(Optical Character Recognition,光学字符识别)和入库.数据清洗子系统基于机器学习算法实现了数据去重、非相关数据识别、数据属性缺陷识别等功能.数据分析子系统则对有效入库的科技政策进一步进行了文本分类、关联关系分析、全文检索.从2018年10月上线以来,该系统从226个数据源采集564 749条数据,经过数据清洗之后入库404 083条数据,能够有力地支撑科技政策研究工作.
Abstract:In order to meet the needs of science and technology policy research, China Association for Science and Technology designs and implements a policy database system. This study first introduces the overall design scheme and system workflow of the science and technology policy database. Then it introduces the system components in detail. The system consists of three subsystems:data acquisition subsystem, data cleaning subsystem and data analysis subsystem. The data acquisition subsystem is based on the Scrapy framework for designing manageable web crawlers for a large number of heterogeneous sites, as well as ABBYY FineReader-based OCR (Optical Character Recognition) for historical documentation. The data cleaning subsystem implements functions such as data deduplication, non-correlated data identification, and data attribute defect recognition based on machine learning algorithms. The data analysis subsystem further carries out text classification, association analysis and full-text search for the effective policies. Since its launch in October 2018, the system has collected 564 749 pieces of data from 226 data sources. After data cleaning, it stores 404 083 pieces of data, which can strongly support the research of science and technology policy.
文章编号:     中图分类号:    文献标志码:
基金项目:
引用文本:
武虹,杨宝龙,杜治高,李涵露.科技政策库的系统集成与建设.计算机系统应用,2019,28(7):58-64
WU Hong,YANG Bao-Long,DU Zhi-Gao,LI Han-Lu.System Integration and Construction of Science and Technology Policy Database.COMPUTER SYSTEMS APPLICATIONS,2019,28(7):58-64