Abstract:The whole lifecycle hosting platform of deep learning offers a web solution to experimental tasks and boosts the application of deep learning technology in production and life. To address the problem of training image recognition models by the platform, this study designs and implements a distributed task execution system for experimental tasks. The system is composed of modules for resource monitoring, task scheduling, task execution, and log management. It schedules tasks according to indicators, such as resource utilization, executes tasks in Docker containers and collects generated log data in real time. The test results demonstrate that the system fulfils the normal functional requirements, achieving the desired targets regarding reliability and stability while reducing about 20% of training time after being integrated into the deep learning platform.