Source author record

Yuval Shpigelman

Yuval Shpigelman appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence math.RA Networking and Internet Architecture Distributed, Parallel, and Cluster Computing Machine Learning

Catalog footprint

What is connected

4works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Resilient AI Supercomputer Networking using MRC and SRv6

Tail latency dominates the performance of synchronous pretraining jobs when running at very large scales. We describe a three-pronged approach: (1) a new RDMA-based transport protocol, MRC, sprays across many paths and actively load-balances between them, eliminating the issue of flow collisions (2) the use of multi-plane Clos topologies to get the benefits of high switch radix and redundancy, allowing training clusters well over 100K GPUs to be built as two-tier topologies while increasing physical redundancy, and (3) the use of static source-routing using SRv6 to allow MRC the freedom to bypass failures by itself. We describe our experiences running MRC and static SRv6 routing in production in OpenAI and Microsoft's largest training clusters, where it has been used to train the latest frontier models. We demonstrate how MRC allows AI training jobs to ride out many network failures that previously would have interrupted training.

preprint2022arXiv

Reinforcement Learning for Datacenter Congestion Control

We approach the task of network congestion control in datacenters using Reinforcement Learning (RL). Successful congestion control algorithms can dramatically improve latency and overall network throughput. Until today, no such learning-based algorithms have shown practical potential in this domain. Evidently, the most popular recent deployments rely on rule-based heuristics that are tested on a predetermined set of benchmarks. Consequently, these heuristics do not generalize well to newly-seen scenarios. Contrarily, we devise an RL-based algorithm with the aim of generalizing to different configurations of real-world datacenter networks. We overcome challenges such as partial-observability, non-stationarity, and multi-objectiveness. We further propose a policy gradient algorithm that leverages the analytical structure of the reward function to approximate its derivative and improve stability. We show that this scheme outperforms alternative popular RL approaches, and generalizes to scenarios that were not seen during training. Our experiments, conducted on a realistic simulator that emulates communication networks' behavior, exhibit improved performance concurrently on the multiple considered metrics compared to the popular algorithms deployed today in real datacenters. Our algorithm is being productized to replace heuristics in some of the largest datacenters in the world.

preprint2015arXiv

On the Codimension Sequence of G-Simple Algebras

In the 80's, Regev, using results of Formanek, Procesi and Razmyslov in invariant theory and Hilbert series', determined asymptotically the codimension sequence of mXm matrices over an algebraically closed field of characteristic zero. Inspired by Regev's ideas, we found that the asymptotics of $c_{n}^{G}(A)$, the G graded codimension sequence of a finite dimensional G simple algebra A, is equal to $αn^{\frac{1-\dim(A_{e})}{2}}(\dim(A)^{n} $ (this was conjectured by E.Aljadeff, D.Haile and M. Natapov), where αis not yet determined number. Moreover, in the case where A is the algebra of mXm matrices with an arbitrary elementary G-grading we also manged to calculate α.

preprint2015arXiv

The Asymptotic Behavior of the Codimension Sequence of Affine G - Graded Algebras

Let W be an affine PI algebra over a field of characteristic zero graded by a finite group G. We show that there exist $α_{1},α_{2}\in\mathbb{R}, β\in\frac{1}{2}\mathbb{Z}$, and $l\in\mathbb{N}$ such that $α_{1}n^βl^{n}\leq c_{n}^{G}(W)\leqα_{2}n^βl^{n}$. Furthermore, if W has a unit then the asymptotic behavior of $c_{n}^{G}(W)$ is $αn^βl^{n}$ where $α\in\mathbb{R}, β\in\frac{1}{2}\mathbb{Z}, l\in\mathbb{N}$.