Large Message Broadcast Design for Distributed Machine Learning

doi:10.15888/j.cnki.csa.007246

AIPUB归智期刊联盟

WeChat

Mobile website

2025-4-25- 8

Home > Archive>Volume 29, Issue 1, 2020 >1-13. DOI:10.15888/j.cnki.csa.007246

PDF HTML XML Export Cite reminder

Large Message Broadcast Design for Distributed Machine Learning
DOI:
                        10.15888/j.cnki.csa.007246
                    
CSTR:
                        [cstr]
                    
Author:
                        XIN Yi-JieXIN Yi-Jie
East China Institute of Computing Technology, Shanghai 201808, China
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
XIE BinXIE Bin
East China Institute of Computing Technology, Shanghai 201808, China
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
LI Zhen-XingLI Zhen-Xing
East China Institute of Computing Technology, Shanghai 201808, China
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Traditionally, Message Passing Interface (MPI) runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and GPU clusters with a relatively smaller number of nodes, efficient communication schemes need to be designed for such systems. This coupled with new application workloads brought forward by Deep Learning (DL) frameworks like Caffe and Microsoft Cognitive Toolkit (CNTK) pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NVIDIA NCCL have emerged to deal with DL workloads. In this study, we address these new challenges for MPI runtimes and propose two new designs to deal with them: (1) a Pipelined Chain (PC) design for MPI_Bcast that provides efficient intra- and inter-node communication of GPU buffers, and (2) a Topology-Aware PC (TA-PC) design for systems with multiple GPUs to fully exploit all the available PCIe links available within a multi-GPU node. To highlight the benefits of proposed designs, we present the performance evaluation on three GPU clusters with diverse characteristics: a dense multi-GPU system RX1, with a single K80 GPU card per node RX2, with a single P100 GPU per node RX3. The proposed designs offer up to 14×and 16.6×better performance than MPI+NCCL1 based solutions for intra- and inter-node broadcast latency. we have enhanced the performance results by adding comparisons for the proposed MPI_Bcast designs as well as ncclBroadcast (NCCL2) design. We report up to 10×better performance for small and medium message sizes and comparable performance for large message sizes. We also observed that the TA-PC design is up to 50% better than the PC design for MPI_Bcast to 64 GPUs. The results clearly highlight the strength of the proposed solution both in terms of portability as well as performance.

Get Citation

辛逸杰,谢彬,李振兴.面向分布式机器学习的大消息广播设计.计算机系统应用,2020,29(1):1-13

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:June 17,2019
Revised:July 12,2019
Adopted:
Online: December 30,2019
Published: January 15,2020

Article QR Code

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address：4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code：100190
Phone：010-62661041 Fax： Email：csa (a) iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063