BandPilot: Towards Performance- and Contention-Aware GPU Dispatching in AI Clusters
Modern multi-tenant AI clusters are increasingly communication-bound, driven by high-volume and multi-round GPU-to-GPU collective communication. Consequently, the GPU dispatcher's choice of a physical GPU subset for each tenant largely determines the job's effective collective bandwidth and thus its performance ceiling. Existing dispatchers predominantly rely on static, topology-aware heuristics that prioritize GPU resource compactness, assuming that minimizing physical distance maximizes communication bandwidth. However, we reveal that this assumption often fails due to complex system-level bottlenecks, such as non-linear NIC saturation and inter-node link heterogeneity.This paper presents BandPilot, a performance- and contention-aware GPU dispatching primitive that optimizes effective collective bandwidth for multi-tenant AI clusters. Specifically, BandPilot learns a data-efficient bandwidth model from sparse NCCL measurements via a hierarchical design. Guided by the model, a fast hybrid search combines an equilibrium-driven constructor with a pruned elimination search to navigate the combinatorial allocation space in real time. To account for multi-tenant interference, BandPilot virtually merges a candidate allocation with co-located cross-host jobs to conservatively estimate shared bottleneck capacity and predict contention-degraded bandwidth. Across a 32-GPU H100 cluster and heterogeneous simulations, BandPilot achieves 92-97% bandwidth efficiency relative to the best-found reference, improving average efficiency by 20-40% over topology-compactness heuristics.