基于Spark的交互式数据预处理系统

doi:10.15888/j.cnki.csa.005453

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年4月3日 18:24 星期四

首页 > 过刊浏览>2016年第25卷第11期 >84-89. DOI:10.15888/j.cnki.csa.005453

PDF HTML阅读 XML下载导出引用引用提醒

基于Spark的交互式数据预处理系统
DOI:
                        10.15888/j.cnki.csa.005453
                    
CSTR:
                        
                    
作者:
                        张磊张磊
中国科学院大学, 北京 100049;中国科学院软件研究所 软件工程技术研究开发中心, 北京 100190
在期刊界中查找
在百度中查找
在本站中查找
朱锋朱锋
中国科学院软件研究所 软件工程技术研究开发中心, 北京 100190
在期刊界中查找
在百度中查找
在本站中查找
钟华钟华
中国科学院软件研究所 软件工程技术研究开发中心, 北京 100190
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家自然科学基金（U1435220）

Interactive Data Preprocessing System Based on Spark

Author:

ZHANG Lei
ZHANG Lei
University of Chinese Academy of Sciences, Beijing 100049, China;Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
在期刊界中查找
在百度中查找
在本站中查找
ZHU Feng
ZHU Feng
Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
在期刊界中查找
在百度中查找
在本站中查找
ZHONG Hua
ZHONG Hua
Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

高质量的决策依赖于高质量的数据，数据预处理是数据挖掘至关重要的环节.传统的数据预处理系统并不能很好的适用于大数据环境，企业现阶段主要使用Hadoop/Hive对海量数据进行预处理，但普遍存在耗时长、效率低、无交互等问题.提出了一种基于Spark的交互式数据预处理系统，系统提供一套通用的数据预处理组件，并支持组件的扩展，数据以电子表格的形式展现，系统记录用户的处理过程并支持撤销重做.本文从数据模型、数据预处理操作、交互式执行引擎以及交互式前端四个方面描述了系统架构.最后使用医疗脑卒中的真实数据对系统进行验证，实验结果表明，系统能够在大数据场景下满足交互式处理需求.

关键词:数据预处理;Spark;交互式;大数据

Abstract:

The high quality decision-making depends on high quality data, hence data preprocessing is an essential phase for data analytics applications. In the big data area, traditional data preprocessing systems cannot be directly applied. To handle the large-scale data, enterprises adopt Hadoop/Hive as a popular solution at the present stage. However, it brings many defects, such as poor performance, the lack of interaction and so on. To fill this gap, this paper proposes and implements an interactive data preprocessing system based on Spark. This system provides a series of common preprocessing logics as basic components and supports flexible user-defined extensions. To get an interactive interface, the system presents data to users in the form of spreadsheets, while it can automatically records users operations to provide undo and redo support. In this paper, we introduce the architecture of this system with four aspects:data model, data preprocessing operations, interactive execution engine and interactive GUI. In the end, we conduct experiments with real stroke data and the result shows that the system can meet interactive demands in most big data scenarios.

Key words:data preprocessing;Spark;interactive;big data

引用本文

张磊,朱锋,钟华.基于Spark的交互式数据预处理系统.计算机系统应用,2016,25(11):84-89

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2016-03-09
最后修改日期:2016-04-08
录用日期:
在线发布日期: 2016-11-15
出版日期:

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码