Large Message Broadcast Design for Distributed Machine Learning
CSTR:
Author:
Affiliation:

Clc Number:

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Traditionally, Message Passing Interface (MPI) runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and GPU clusters with a relatively smaller number of nodes, efficient communication schemes need to be designed for such systems. This coupled with new application workloads brought forward by Deep Learning (DL) frameworks like Caffe and Microsoft Cognitive Toolkit (CNTK) pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NVIDIA NCCL have emerged to deal with DL workloads. In this study, we address these new challenges for MPI runtimes and propose two new designs to deal with them: (1) a Pipelined Chain (PC) design for MPI_Bcast that provides efficient intra- and inter-node communication of GPU buffers, and (2) a Topology-Aware PC (TA-PC) design for systems with multiple GPUs to fully exploit all the available PCIe links available within a multi-GPU node. To highlight the benefits of proposed designs, we present the performance evaluation on three GPU clusters with diverse characteristics: a dense multi-GPU system RX1, with a single K80 GPU card per node RX2, with a single P100 GPU per node RX3. The proposed designs offer up to 14×and 16.6×better performance than MPI+NCCL1 based solutions for intra- and inter-node broadcast latency. we have enhanced the performance results by adding comparisons for the proposed MPI_Bcast designs as well as ncclBroadcast (NCCL2) design. We report up to 10×better performance for small and medium message sizes and comparable performance for large message sizes. We also observed that the TA-PC design is up to 50% better than the PC design for MPI_Bcast to 64 GPUs. The results clearly highlight the strength of the proposed solution both in terms of portability as well as performance.

    Reference
    Related
    Cited by
Get Citation

辛逸杰,谢彬,李振兴.面向分布式机器学习的大消息广播设计.计算机系统应用,2020,29(1):1-13

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:June 17,2019
  • Revised:July 12,2019
  • Adopted:
  • Online: December 30,2019
  • Published: January 15,2020
Article QR Code
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address:4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code:100190
Phone:010-62661041 Fax: Email:csa (a) iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063