Source author record

Gaoxiong Zeng

Gaoxiong Zeng appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Machine Learning Networking and Internet Architecture

Catalog footprint

What is connected

2works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Congestion Control Mechanisms for Inter-Datacenter Networks

Applications running in geographically distributed setting are becoming prevalent. Large-scale online services often share or replicate their data into multiple data centers (DCs) in different geographic regions. Driven by the data communication need of these applications, inter-datacenter network (IDN) is getting increasingly important. However, we find congestion control for inter-datacenter networks quite challenging. Firstly, the inter-datacenter communication involves both data center networks (DCNs) and wide-area networks (WANs) connecting multiple data centers. Such a network environment presents quite heterogeneous characteristics (e.g., buffer depths, RTTs). Existing congestion control mechanisms consider either DCN or WAN congestion, while not simultaneously capturing the degree of congestion for both. Secondly, to reduce evolution cost and improve flexibility, large enterprises have been building and deploying their wide-area routers based on shallow-buffered switching chips. However, with legacy congestion control mechanisms (e.g., TCP Cubic), shallow buffer can easily get overwhelmed by large BDP (bandwidth-delay product) wide-area traffic, leading to high packet losses and degraded throughput. This thesis describes my research efforts on optimizing congestion control mechanisms for the inter-datacenter networks. First, we design GEMINI - a practical congestion control mechanism that simultaneously handles congestions both in DCN andWAN. Second, we present FlashPass - a proactive congestion control mechanism that achieves near zero loss without degrading throughput under the shallow-buffered WAN. Extensive evaluation shows their superior performance over existing congestion control mechanisms.

preprint2020arXiv

Domain-specific Communication Optimization for Distributed DNN Training

Communication overhead poses an important obstacle to distributed DNN training and draws increasing attention in recent years. Despite continuous efforts, prior solutions such as gradient compression/reduction, compute/communication overlapping and layer-wise flow scheduling, etc., are still coarse-grained and insufficient for an efficient distributed training especially when the network is under pressure. We present DLCP, a novel solution exploiting the domain-specific properties of deep learning to optimize communication overhead of DNN training in a fine-grained manner. At its heart, DLCP comprises of several key innovations beyond prior work: e.g., it exploits {\em bounded loss tolerance} of SGD-based training to improve tail communication latency which cannot be avoided purely through gradient compression. It then performs fine-grained packet-level prioritization and dropping, as opposed to flow-level scheduling, based on layers and magnitudes of gradients to further speedup model convergence without affecting accuracy. In addition, it leverages inter-packet order-independency to perform per-packet load balancing without causing classical re-ordering issues. DLCP works with both Parameter Server and collective communication routines. We have implemented DLCP with commodity switches, integrated it with various training frameworks including TensorFlow, MXNet and PyTorch, and deployed it in our small-scale testbed with 10 Nvidia V100 GPUs. Our testbed experiments and large-scale simulations show that DLCP delivers up to $84.3\%$ additional training acceleration over the best existing solutions.