###
计算机系统应用英文版:2016,25(11):84-89
本文二维码信息
码上扫一扫!
基于Spark的交互式数据预处理系统
(1.中国科学院大学, 北京 100049;2.中国科学院软件研究所 软件工程技术研究开发中心, 北京 100190)
Interactive Data Preprocessing System Based on Spark
(1.University of Chinese Academy of Sciences, Beijing 100049, China;2.Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China)
摘要
图/表
参考文献
相似文献
本文已被:浏览 1655次   下载 3790
Received:March 09, 2016    Revised:April 08, 2016
中文摘要: 高质量的决策依赖于高质量的数据,数据预处理是数据挖掘至关重要的环节.传统的数据预处理系统并不能很好的适用于大数据环境,企业现阶段主要使用Hadoop/Hive对海量数据进行预处理,但普遍存在耗时长、效率低、无交互等问题.提出了一种基于Spark的交互式数据预处理系统,系统提供一套通用的数据预处理组件,并支持组件的扩展,数据以电子表格的形式展现,系统记录用户的处理过程并支持撤销重做.本文从数据模型、数据预处理操作、交互式执行引擎以及交互式前端四个方面描述了系统架构.最后使用医疗脑卒中的真实数据对系统进行验证,实验结果表明,系统能够在大数据场景下满足交互式处理需求.
中文关键词: 数据预处理  Spark  交互式  大数据
Abstract:The high quality decision-making depends on high quality data, hence data preprocessing is an essential phase for data analytics applications. In the big data area, traditional data preprocessing systems cannot be directly applied. To handle the large-scale data, enterprises adopt Hadoop/Hive as a popular solution at the present stage. However, it brings many defects, such as poor performance, the lack of interaction and so on. To fill this gap, this paper proposes and implements an interactive data preprocessing system based on Spark. This system provides a series of common preprocessing logics as basic components and supports flexible user-defined extensions. To get an interactive interface, the system presents data to users in the form of spreadsheets, while it can automatically records users operations to provide undo and redo support. In this paper, we introduce the architecture of this system with four aspects:data model, data preprocessing operations, interactive execution engine and interactive GUI. In the end, we conduct experiments with real stroke data and the result shows that the system can meet interactive demands in most big data scenarios.
文章编号:     中图分类号:    文献标志码:
基金项目:国家自然科学基金(U1435220)
引用文本:
张磊,朱锋,钟华.基于Spark的交互式数据预处理系统.计算机系统应用,2016,25(11):84-89
ZHANG Lei,ZHU Feng,ZHONG Hua.Interactive Data Preprocessing System Based on Spark.COMPUTER SYSTEMS APPLICATIONS,2016,25(11):84-89