本文已被:浏览 818次 下载 1942次
Received:November 03, 2020 Revised:December 02, 2020
Received:November 03, 2020 Revised:December 02, 2020
中文摘要: 深度学习全流程托管平台提供了深度学习实验任务的网页端解决方案, 加速了深度学习技术在生产生活中的应用. 为了解决网页端深度学习平台进行图像识别模型训练的问题, 本文设计实现了面向深度学习实验任务的分布式任务执行系统. 系统由资源监控、任务调度、任务执行、日志管理4大模块组成, 将任务依据资源使用率等策略进行调度, 采用Docker容器技术进行执行, 并对产生的日志信息进行了实时收集. 经过测试, 分布式任务执行系统不仅保证了正常的功能需求, 在可靠性、稳定性等指标上也都达到了预期的要求, 将其集成到平台中可减少20%左右的训练时间.
Abstract:The whole lifecycle hosting platform of deep learning offers a web solution to experimental tasks and boosts the application of deep learning technology in production and life. To address the problem of training image recognition models by the platform, this study designs and implements a distributed task execution system for experimental tasks. The system is composed of modules for resource monitoring, task scheduling, task execution, and log management. It schedules tasks according to indicators, such as resource utilization, executes tasks in Docker containers and collects generated log data in real time. The test results demonstrate that the system fulfils the normal functional requirements, achieving the desired targets regarding reliability and stability while reducing about 20% of training time after being integrated into the deep learning platform.
文章编号: 中图分类号: 文献标志码:
基金项目:
引用文本:
高国樑,陈雷放,刘一鸣.面向深度学习的分布式任务执行系统.计算机系统应用,2021,30(7):80-86
GAO Guo-Liang,CHEN Lei-Fang,LIU Yi-Ming.Distributed Task Execution System for Deep Learning.COMPUTER SYSTEMS APPLICATIONS,2021,30(7):80-86
高国樑,陈雷放,刘一鸣.面向深度学习的分布式任务执行系统.计算机系统应用,2021,30(7):80-86
GAO Guo-Liang,CHEN Lei-Fang,LIU Yi-Ming.Distributed Task Execution System for Deep Learning.COMPUTER SYSTEMS APPLICATIONS,2021,30(7):80-86