Source author record

Srinivas Sridharan

Srinivas Sridharan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.OC Distributed, Parallel, and Cluster Computing quant-ph Machine Learning Systems and Control Hardware Architecture math.NA Networking and Internet Architecture Performance

Catalog footprint

What is connected

16works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

The fast pace of artificial intelligence~(AI) innovation demands an agile methodology for observation, reproduction and optimization of distributed machine learning~(ML) workload behavior in production AI systems and enables efficient software-hardware~(SW-HW) co-design for future systems. We present Chakra, an open and portable ecosystem for performance benchmarking and co-design. The core component of Chakra is an open and interoperable graph-based representation of distributed AI/ML workloads, called Chakra execution trace~(ET). These ETs represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools. We present analysis of Chakra ETs collected on production AI clusters and demonstrate value via real-world case studies. Chakra has been adopted by MLCommons and has active contributions and engagement across the industry, including but not limited to NVIDIA, AMD, Meta, Keysight, HPE, and Scala, to name a few.

preprint2022arXiv

Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms

Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth. However, as we identify in this work, driving this bandwidth is quite challenging. This is because there is a pernicious balance between using the accelerator's compute and memory for both DL computations and communication. This work makes two key contributions. First, via real system measurements and detailed modeling, we provide an understanding of compute and memory bandwidth demands for DL compute and comms. Second, we propose a novel DL collective communication accelerator called Accelerator Collectives Engine (ACE) that sits alongside the compute and networking engines at the accelerator endpoint. ACE frees up the endpoint's compute and memory resources for DL compute, which in turn reduces the required memory BW by 3.5X on average to drive the same network BW compared to state-of-the-art baselines. For modern DL workloads and different network sizes, ACE, on average, increases the effective network bandwidth utilization by 1.44X (up to 2.67X), resulting in an average of 1.41X (up to 1.51X), 1.12X (up to 1.17X), and 1.13X (up to 1.19X) speedup in iteration time for ResNet-50, GNMT and DLRM when compared to the best baseline configuration, respectively.

preprint2022arXiv

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multi-dimensional networks with diverse, heterogeneous bandwidths. This work identifies a looming challenge of keeping all network dimensions busy and maximizing the network BW within the hybrid environment if we leverage scheduling techniques for collective communication on systems today. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of the single All-Reduce by 1.72X (2.70X max), and improve the end-to-end training iteration performance of real workloads such as ResNet-152, GNMT, DLRM, and Transformer-1T by 1.49X (2.25X max), 1.30X (1.78X max), 1.30X (1.77X max), and 1.25X (1.53X max), respectively.

preprint2020arXiv

Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems

Large-scale training is important to ensure high performance and accuracy of machine-learning models. At Facebook we use many different models, including computer vision, video and language models. However, in this paper we focus on the deep learning recommendation models (DLRMs), which are responsible for more than 50% of the training demand in our data centers. Recommendation models present unique challenges in training because they exercise not only compute but also memory capacity as well as memory and network bandwidth. As model size and complexity increase, efficiently scaling training becomes a challenge. To address it we design Zion - Facebook's next-generation large-memory training platform that consists of both CPUs and accelerators. Also, we discuss the design requirements of future scale-out training systems.

preprint2016arXiv

Distributed Deep Learning Using Synchronous Stochastic Gradient Descent

We design and implement a distributed multinode synchronous SGD algorithm, without altering hyper parameters, or compressing data, or altering algorithmic behavior. We perform a detailed analysis of scaling, and identify optimal design points for different networks. We demonstrate scaling of CNNs on 100s of nodes, and present what we believe to be record training throughputs. A 512 minibatch VGG-A CNN training run is scaled 90X on 128 nodes. Also 256 minibatch VGG-A and OverFeat-FAST networks are scaled 53X and 42X respectively on a 64 node cluster. We also demonstrate the generality of our approach via best-in-class 6.5X scaling for a 7-layer DNN on 16 nodes. Thereafter we attempt to democratize deep-learning by training on an Ethernet based AWS cluster and show ~14X scaling on 16 nodes.

preprint2014arXiv

Bundle-based pruning in the max-plus curse of dimensionality free method

Recently a new class of techniques termed the max-plus curse of dimensionality-free methods have been developed to solve nonlinear optimal control problems. In these methods the discretization in state space is avoided by using a max-plus basis expansion of the value function. This requires storing only the coefficients of the basis functions used for representation. However, the number of basis functions grows exponentially with respect to the number of time steps of propagation to the time horizon of the control problem. This so called "curse of complexity" can be managed by applying a pruning procedure which selects the subset of basis functions that contribute most to the approximation of the value function. The pruning procedures described thus far in the literature rely on the solution of a sequence of high dimensional optimization problems which can become computationally expensive. In this paper we show that if the max-plus basis functions are linear and the region of interest in state space is convex, the pruning problem can be efficiently solved by the bundle method. This approach combining the bundle method and semidefinite formulations is applied to the quantum gate synthesis problem, in which the state space is the special unitary group (which is non-convex). This is based on the observation that the convexification of the unitary group leads to an exact relaxation. The results are studied and validated via examples.

preprint2013arXiv

Efficient Desynchronization of Thermostatically Controlled Loads

This paper considers demand side management in smart power grid systems containing significant numbers of thermostatically controlled loads such as air conditioning systems, heat pumps, etc. Recent studies have shown that the overall power consumption of such systems can be regulated up and down centrally by broadcasting small setpoint change commands without significantly impacting consumer comfort. However, sudden simultaneous setpoint changes induce undesirable power consumption oscillations due to sudden synchronization of the on/off cycles of the individual units. In this paper, we present a novel algorithm for counter-acting these unwanted oscillations, which requires neither central management of the individual units nor communication between units. We present a formal proof of convergence of homogeneous populations to desynchronized status, as well as simulations that indicate that the algorithm is able to effectively dampen power consumption oscillations for both homogeneous and heterogeneous populations of thermostatically controlled loads.

preprint2012arXiv

Deterministic filtering and dimensionality reduction for optimal attitude estimation on SO(3)

In this article we introduce the use of recently developed min/max-plus techniques in order to solve the optimal attitude estimation problem in filtering for nonlinear systems on the special orthogonal (SO(3)) group. This work helps obtain computationally efficient methods for the synthesis of deterministic filters for nonlinear systems -- i.e. optimal filters which estimate the state using a related optimal control problem. The technique indicated herein is validated using a set of optimal attitude estimation example problems on SO(3).

preprint2012arXiv

Deterministic filtering and max-plus methods for robust state estimation in multi-sensor settings

A robust (deterministic) filtering approach to the problem of optimal sensor selection is considered herein. For a given system with several sensors, at each time step the output of one of the sensors must be chosen in order to obtain the best state estimate. We reformulate this problem in an optimal control framework which can then be solved using dynamic programming. In order to tackle the numerical computation of the solution in an efficient manner, we exploit the preservation of the min-plus structure of the optimal cost function when acted upon by the dynamic programming operator. This technique yields a grid free numerical approach to the problem. Simulations on an example problem serve to highlight the efficacy of this generalizable approach to robust multi-sensor state estimation.

preprint2012arXiv

Min-Plus approaches and Cluster Based Pruning for Filtering in Nonlinear Systems

The design of deterministic filters can be cast as a problem of minimizing an associated cost function for an optimal control problem. Employing the min-plus linearity property of the dynamic programming operator (associated with the control problem) results in a computationally feasible approach (while avoiding linearization of the system dynamics/output). This article describes the salient features of this approach and a specific form of pruning/projection, based on clustering, which serves to facilitate the numerical efficiency of these methods.

preprint2012arXiv

Min-Plus Techniques for Set-Valued State Estimation

This article approaches deterministic filtering via an application of the min-plus linearity of the corresponding dynamic programming operator. This filter design method yields a set-valued state estimator for discrete-time nonlinear systems (nonlinear dynamics and output functions). The energy bounds in the process and the measurement disturbances are modeled using a sum quadratic constraint. The filtering problem is recast into an optimal control problem in the form of a Hamilton-Jacobi-Bellman (HJB) equation, the solution to which is obtained by employing the min-plus linearity property of the dynamic programming operator. This approach enables the solution to the HJB equation and the design of the filter without recourse to linearization of the system dynamics/ output equation.

preprint2012arXiv

Optimal rotation control for a qubit subject to continuous measurement

In this article we analyze the optimal control strategy for rotating a monitored qubit from an initial pure state to an orthogonal state in minimum time. This strategy is described for two different cost functions of interest which do not have the usual regularity properties. Hence, as classically smooth cost functions may not exist, we interpret these functions as viscosity solutions to the optimal control problem. Specifically we prove their existence and uniqueness in this weak-solution setting. In addition, we also give bounds on the time optimal control to prepare any pure state from a mixed state.

preprint2011arXiv

Optimal rotation of a qubit under dynamic measurement and velocity control

In this article we explore a modification in the problem of controlling the rotation of a two level quantum system from an initial state to a final state in minimum time. Specifically we consider the case where the qubit is being weakly monitored -- albeit with an assumption that both the measurement strength as well as the angular velocity are assumed to be control signals. This modification alters the dynamics significantly and enables the exploitation of the measurement backaction to assist in achieving the control objective. The proposed method yields a significant speedup in achieving the desired state transfer compared to previous approaches. These results are demonstrated via numerical solutions for an example problem on a single qubit.

preprint2010arXiv

A reduced complexity numerical method for optimal gate synthesis

Although quantum computers have the potential to efficiently solve certain problems considered difficult by known classical approaches, the design of a quantum circuit remains computationally difficult. It is known that the optimal gate design problem is equivalent to the solution of an associated optimal control problem, the solution to which is also computationally intensive. Hence, in this article, we introduce the application of a class of numerical methods (termed the max-plus curse of dimensionality free techniques) that determine the optimal control thereby synthesizing the desired unitary gate. The application of this technique to quantum systems has a growth in complexity that depends on the cardinality of the control set approximation rather than the much larger growth with respect to spatial dimensions in approaches based on gridding of the space, used in previous literature. This technique is demonstrated by obtaining an approximate solution for the gate synthesis on $SU(4)$- a problem that is computationally intractable by grid based approaches.

preprint2010arXiv

Numerical Solution of the Dynamic Programming Equation for the Optimal Control of Quantum Spin Systems

The purpose of this paper is to describe the numerical solution of the Hamilton-Jacobi-Bellman (HJB) for an optimal control problem for quantum spin systems. This HJB equation is a first order nonlinear partial differential equation defined on a Lie group. We employ recent extensions of the theory of viscosity solutions from Euclidean space to Riemannian manifolds to interpret possibly non-differentiable solutions to this equation. Results from differential topology on the triangulation of manifolds are then used to develop a finite difference approximation method, which is shown to converge using viscosity solution techniques. An example is provided to illustrate the method.

preprint2008arXiv

Gate complexity using Dynamic Programming

The relationship between efficient quantum gate synthesis and control theory has been a topic of interest in the quantum control literature. Motivated by this work, we describe in the present article how the dynamic programming technique from optimal control may be used for the optimal synthesis of quantum circuits. We demonstrate simulation results on an example system on SU(2), to obtain plots related to the gate complexity and sample paths for different logic gates.

Srinivas Sridharan

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems

Distributed Deep Learning Using Synchronous Stochastic Gradient Descent

Bundle-based pruning in the max-plus curse of dimensionality free method

Efficient Desynchronization of Thermostatically Controlled Loads

Deterministic filtering and dimensionality reduction for optimal attitude estimation on SO(3)

Deterministic filtering and max-plus methods for robust state estimation in multi-sensor settings

Min-Plus approaches and Cluster Based Pruning for Filtering in Nonlinear Systems

Min-Plus Techniques for Set-Valued State Estimation

Optimal rotation control for a qubit subject to continuous measurement

Optimal rotation of a qubit under dynamic measurement and velocity control

A reduced complexity numerical method for optimal gate synthesis

Numerical Solution of the Dynamic Programming Equation for the Optimal Control of Quantum Spin Systems

Gate complexity using Dynamic Programming