面向任务调度优化的分布式系统信息管理框架

doi:10.15888/j.cnki.csa.007166

AIPUB归智期刊联盟

微信公众号

网站二维码

2025年4月1日 7:51 星期二

首页 > 过刊浏览>2019年第28卷第11期 >54-62. DOI:10.15888/j.cnki.csa.007166

PDF HTML阅读 XML下载导出引用引用提醒

面向任务调度优化的分布式系统信息管理框架
DOI:
                        10.15888/j.cnki.csa.007166
                    
CSTR:
                        
                    
作者:
                        胡亚辉胡亚辉
中国科学技术大学 计算机科学与技术学院, 合肥 230027
在期刊界中查找
在百度中查找
在本站中查找
朱宗卫朱宗卫
中国科学技术大学 软件学院, 苏州 215123
在期刊界中查找
在百度中查找
在本站中查找
刘黄河刘黄河
中国科学技术大学 软件学院, 苏州 215123
在期刊界中查找
在百度中查找
在本站中查找
王超王超
中国科学技术大学 计算机科学与技术学院, 合肥 230027
在期刊界中查找
在百度中查找
在本站中查找

                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:

System Information Management Framework of Distributed System for Task Scheduling Optimization

Author:

HU Ya-Hui
HU Ya-Hui
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China
在期刊界中查找
在百度中查找
在本站中查找
ZHU Zong-Wei
ZHU Zong-Wei
School of Software Engineering, University of Science and Technology of China, Suzhou 215123, China
在期刊界中查找
在百度中查找
在本站中查找
LIU Huang-He
LIU Huang-He
School of Software Engineering, University of Science and Technology of China, Suzhou 215123, China
在期刊界中查找
在百度中查找
在本站中查找
WANG Chao
WANG Chao
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China
在期刊界中查找
在百度中查找
在本站中查找

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献 [14]

相似文献 [20]

引证文献

资源附件

文章评论

摘要:

近年来深度学习作为学术界与工业界共同关注的热点，取得了飞跃式的发展，在计算机视觉、语音识别等领域取得了令人瞩目的成果.深度学习分训练与推理两个阶段，在实际应用中主要关注的是推理阶段.深度学习推理过程中伴随着巨大的计算量，通过分布式系统提高其计算速度也得到了越来越多的关注.然而，构建分布式深度学习推理系统面临着深度学习加速设备更新迭代快速、上层应用及计算任务复杂多样等挑战.本文设计并实现的系统信息管理框架，用于收集并处理系统中的各类信息，收集及处理的规则具有高度的可扩展性和灵活性，并提供通用的RESTful API数据访问接口，以支持分布式深度学习推理系统对各类硬件加速器的灵活兼容性以及对任务调度策略的动态调整能力.最后，本文通过一个应用实例对该框架的功能进行验证并对实验结果进行分析.

关键词:分布式系统;深度学习推理;任务调度;系统信息管理

Abstract:

In recent years, deep learning, as a hotspot of common concern in academia and industry, has made great progress and achieved remarkable achievements in computer vision, speech recognition and other fields. It is divided into two stages:training and inferencing. In practical application, the main concern is the inferencing stage. The process of deep learning inferecing is accompanied by a huge amount of computation, and more and more attention has been paid to using distributed system to improve its computing speed. However, the construction of distributed deep learning inferencing system is faced with the challenges such as rapid updating and iteration of deep learning accelerators, complex of applications and computing tasks. The information management mechanism proposed in this study is used to collect and process all kinds of information in the distributed system, and the rules of collection and processing are highly customizable and flexible. It also provides a universal RESTful API data access interface to support the flexible compatibility of various hardware and the dynamic adjustment ability of task scheduling strategy in the deep learning inferencing system. Finally, we verified the function of the mechanism through an example and analysed the experimental results.

Key words:distributed system;deep learning inference;task scheduling;system information managing

参考文献

[1] Wang T, Wang C, Zhou XH, et al. A survey of FPGA based deep learning accelerators:Challenges and opportunities. arXiv preprint arXiv:1901.04988, 2018.

[2] Wang C, Gong L, Yu Q, et al. DLAU:A scalable deep learning accelerator unit on FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017, 36(3):513-517

[3] Zhang YW, Wang C, Gong L, et al. A power-efficient accelerator based on FPGAs for LSTM network. Proceedings of 2017 IEEE International Conference on Cluster Computing. Honolulu, HI, USA. 2017. 629-630.

[4] Sun F, Wang C, Gong L, et al. A power-efficient accelerator for convolutional neural networks. Proceedings of 2017 IEEE International Conference on Cluster Computing. Honolulu, HI, USA. 2017. 631-632.

[5] Lu YT, Gong L, Xu CC, et al. Work-in-progress:A high-performance FPGA accelerator for sparse neural networks. Proceedings of 2017 International Conference on Compilers, Architectures and Synthesis For Embedded Systems. Seoul, South Korea. 2017. 1-2.

[6] Chen YJ, Luo T, Liu SL, et al. DaDianNao:A machine-learning supercomputer. Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. Cambridge, UK. 2014. 609-622.

[7] Chen TS, Du ZD, Sun NH, et al. Diannao:A small-footprint high-throughput accelerator for ubiquitous machine-learning. Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. Salt Lake City, UT, USA. 2014. 269-284.

[8] Jouppi NP, Young C, Patil N, et al. In-datacenter performance analysis of a tensor processing unit. Proceedings of the ACM/IEEE 44th Annual International Symposium on Computer Architecture. Toronto, ON, Canada. 2017. 1-12.

[9] Burger D. Microsoft unveils project brainwave for real-time AI. Microsoft Research Blog, Microsoft 2017. https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/.

[10] Extend the YARN resource model for easier resource-type management and profiles. https://issues.apache.org/jira/browse/YARN-3926.

[11] Lopes RV, Menascé D. A taxonomy of job scheduling on distributed computing systems. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(12):3412-3428.[doi:10.1109/TPDS.2016.2537821

[12] Gautam JV, Prajapati HB, Dabhi VK, et al. A survey on job scheduling algorithms in big data processing. Proceedings of 2015 IEEE International Conference on Electrical, Computer and Communication Technologies. Coimbatore, India. 2015. 1-11.

[13] Hammoud M, Sakr MF. Locality-aware reduce task scheduling for MapReduce. Proceedings of the IEEE 3rd International Conference on Cloud Computing Technology and Science. Athens, Greece. 2011. 570-576.

[14] Arslan E, Shekhar M, Kosar T. Locality and network-aware reduce task scheduling for data-intensive applications. Proceedings of the 5th International Workshop on Data-Intensive Computing in the Clouds. New Orleans, LA, USA. 2014. 17-24.

引用本文

胡亚辉,朱宗卫,刘黄河,王超.面向任务调度优化的分布式系统信息管理框架.计算机系统应用,2019,28(11):54-62

复制

文章指标

点击次数:2813
下载次数: 2712
HTML阅读次数: 1405
引用次数: 0

历史

收稿日期:2019-04-26
最后修改日期:2019-05-23
录用日期:
在线发布日期: 2019-11-08
出版日期: 2019-11-15

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码

微信公众号

网站二维码

引用本文

分享

微信扫一扫：分享

文章指标

历史

文章二维码