Researcher profile

Pavan Balaji

Pavan Balaji contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 15 - UnverifiedVerification L1Unclaimed author
3works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

3 published item(s)

preprint2026arXiv

Collective Communication for 100k+ GPUs

The increasing scale of large language models (LLMs) necessitates highly efficient collective communication frameworks, particularly as training workloads extend to hundreds of thousands of GPUs. Traditional communication methods face significant throughput and latency limitations at this scale, hindering both the development and deployment of state-of-the-art models. This paper presents the NCCLX collective communication framework, developed at Meta, engineered to optimize performance across the full LLM lifecycle, from the synchronous demands of large-scale training to the low-latency requirements of inference. The framework is designed to support complex workloads on clusters exceeding 100,000 GPUs, ensuring reliable, high-throughput, and low-latency data exchange. Empirical evaluation on the Llama4 model demonstrates substantial improvements in communication efficiency. This research contributes a robust solution for enabling the next generation of LLMs to operate at unprecedented scales.

preprint2020arXiv

How I Learned to Stop Worrying About User-Visible Endpoints and Love MPI

MPI+threads is gaining prominence as an alternative to the traditional MPI everywhere model in order to better handle the disproportionate increase in the number of cores compared with other on-node resources. However, the communication performance of MPI+threads can be 100x slower than that of MPI everywhere. Both MPI users and developers are to blame for this slowdown. Typically, MPI users do not expose logical communication parallelism. Consequently, MPI libraries use conservative approaches, such as a global critical section, to maintain MPI's ordering constraints for MPI+threads, thus serializing access to parallel network resources and hurting performance. To enhance MP+threads' communication performance, researchers have proposed MPI Endpoints as a user-visible extension to MPI-3.1. MPI Endpoints allows a single process to create multiple MPI ranks within a communicator. This could allow each thread to have a dedicated communication path to the network and improve performance. The onus of mapping threads to endpoints, however, would then be on domain scientists. We play the role of devil's advocate and question the need for user-visible endpoints. We certainly agree that dedicated communication channels are critical. To what extent, however, can we hide these channels inside the MPI library without modifying the MPI standard and thus unburden the user? More important, what functionality would we lose through such abstraction? This paper answers these questions through a new MPI-3.1 implementation that uses virtual communication interfaces (VCIs). VCIs abstract underlying network contexts. When users expose parallelism through existing MPI mechanisms, the MPI library maps that parallelism to the VCIs, relieving domain scientists from endpoints. We identify cases where VCIs perform as well as user-visible endpoints, as well as cases where such abstraction hurts performance.

preprint2020arXiv

Scalable Communication Endpoints for MPI+Threads Applications

Hybrid MPI+threads programming is gaining prominence as an alternative to the traditional "MPI everywhere'" model to better handle the disproportionate increase in the number of cores compared with other on-node resources. Current implementations of these two models represent the two extreme cases of communication resource sharing in modern MPI implementations. In the MPI-everywhere model, each MPI process has a dedicated set of communication resources (also known as endpoints), which is ideal for performance but is resource wasteful. With MPI+threads, current MPI implementations share a single communication endpoint for all threads, which is ideal for resource usage but is hurtful for performance. In this paper, we explore the tradeoff space between performance and communication resource usage in MPI+threads environments. We first demonstrate the two extreme cases---one where all threads share a single communication endpoint and another where each thread gets its own dedicated communication endpoint (similar to the MPI-everywhere model) and showcase the inefficiencies in both these cases. Next, we perform a thorough analysis of the different levels of resource sharing in the context of Mellanox InfiniBand. Using the lessons learned from this analysis, we design an improved resource-sharing model to produce \emph{scalable communication endpoints} that can achieve the same performance as with dedicated communication resources per thread but using just a third of the resources.