Source author record

Yogish Sabharwal

Yogish Sabharwal appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Distributed, Parallel, and Cluster Computing Machine Learning Computation and Language Computational Complexity Databases Discrete Mathematics math.OC

Catalog footprint

What is connected

12works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Effective Elastic Scaling of Deep Learning Workloads

The increased use of deep learning (DL) in academia, government and industry has, in turn, led to the popularity of on-premise and cloud-hosted deep learning platforms, whose goals are to enable organizations utilize expensive resources effectively, and to share said resources among multiple teams in a fair and effective manner. In this paper, we examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms and propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization. We begin by analyzing DL workloads and exploit the fact that DL jobs can be run with a range of batch sizes without affecting their final accuracy. We formulate an optimization problem that explores a dynamic batch size allocation to individual DL jobs based on their scaling efficiency, when running on multiple nodes. We design a fast dynamic programming based optimizer to solve this problem in real-time to determine jobs that can be scaled up/down, and use this optimizer in an autoscaler to dynamically change the allocated resources and batch sizes of individual DL jobs. We demonstrate empirically that our elastic scaling algorithm can complete up to $\approx 2 \times$ as many jobs as compared to a strong baseline algorithm that also scales the number of GPUs but does not change the batch size. We also demonstrate that the average completion time with our algorithm is up to $\approx 10 \times$ faster than that of the baseline.

preprint2020arXiv

On Optimizing Distributed Tucker Decomposition for Sparse Tensors

The Tucker decomposition generalizes the notion of Singular Value Decomposition (SVD) to tensors, the higher dimensional analogues of matrices. We study the problem of constructing the Tucker decomposition of sparse tensors on distributed memory systems via the HOOI procedure, a popular iterative method. The scheme used for distributing the input tensor among the processors (MPI ranks) critically influences the HOOI execution time. Prior work has proposed different distribution schemes: an offline scheme based on sophisticated hypergraph partitioning method and simple, lightweight alternatives that can be used real-time. While the hypergraph based scheme typically results in faster HOOI execution time, being complex, the time taken for determining the distribution is an order of magnitude higher than the execution time of a single HOOI iteration. Our main contribution is a lightweight distribution scheme, which achieves the best of both worlds. We show that the scheme is near-optimal on certain fundamental metrics associated with the HOOI procedure and as a result, near-optimal on the computational load (FLOPs). Though the scheme may incur higher communication volume, the computation time is the dominant factor and as the result, the scheme achieves better performance on the overall HOOI execution time. Our experimental evaluation on large real-life tensors (having up to 4 billion elements) shows that the scheme outperforms the prior schemes on the HOOI execution time by a factor of up to 3x. On the other hand, its distribution time is comparable to the prior lightweight schemes and is typically lesser than the execution time of a single HOOI iteration.

preprint2020arXiv

PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination

We develop a novel method, called PoWER-BERT, for improving the inference time of the popular BERT model, while maintaining the accuracy. It works by: a) exploiting redundancy pertaining to word-vectors (intermediate encoder outputs) and eliminating the redundant vectors. b) determining which word-vectors to eliminate by developing a strategy for measuring their significance, based on the self-attention mechanism. c) learning how many word-vectors to eliminate by augmenting the BERT model and the loss function. Experiments on the standard GLUE benchmark shows that PoWER-BERT achieves up to 4.5x reduction in inference time over BERT with <1% loss in accuracy. We show that PoWER-BERT offers significantly better trade-off between accuracy and inference time compared to prior methods. We demonstrate that our method attains up to 6.8x reduction in inference time with <1% loss in accuracy when applied over ALBERT, a highly compressed version of BERT. The code for PoWER-BERT is publicly available at https://github.com/IBM/PoWER-BERT.

preprint2016arXiv

Improvable Knapsack Problems

We consider a variant of the knapsack problem, where items are available with different possible weights. Using a separate budget for these item improvements, the question is: Which items should be improved to which degree such that the resulting classic knapsack problem yields maximum profit? We present a detailed analysis for several cases of improvable knapsack problems, presenting constant factor approximation algorithms and two PTAS.

preprint2016arXiv

Subgraph Counting: Color Coding Beyond Trees

The problem of counting occurrences of query graphs in a large data graph, known as subgraph counting, is fundamental to several domains such as genomics and social network analysis. Many important special cases (e.g. triangle counting) have received significant attention. Color coding is a very general and powerful algorithmic technique for subgraph counting. Color coding has been shown to be effective in several applications, but scalable implementations are only known for the special case of {\em tree queries} (i.e. queries of treewidth one). In this paper we present the first efficient distributed implementation for color coding that goes beyond tree queries: our algorithm applies to any query graph of treewidth $2$. Since tree queries can be solved in time linear in the size of the data graph, our contribution is the first step into the realm of colour coding for queries that require superlinear running time in the worst case. This superlinear complexity leads to significant load balancing problems on graphs with heavy tailed degree distributions. Our algorithm structures the computation to work around high degree nodes in the data graph, and achieves very good runtime and scalability on a diverse collection of data and query graph pairs as a result. We also provide theoretical analysis of our algorithmic techniques, showing asymptotic improvements in runtime on random graphs with power law degree distributions, a popular model for real world graphs.

preprint2013arXiv

Distributed and Parallel Algorithms for Set Cover Problems with Small Neighborhood Covers

In this paper, we study a class of set cover problems that satisfy a special property which we call the {\em small neighborhood cover} property. This class encompasses several well-studied problems including vertex cover, interval cover, bag interval cover and tree cover. We design unified distributed and parallel algorithms that can handle any set cover problem falling under the above framework and yield constant factor approximations. These algorithms run in polylogarithmic communication rounds in the distributed setting and are in NC, in the parallel setting.

preprint2012arXiv

Density Functions subject to a Co-Matroid Constraint

In this paper we consider the problem of finding the {\em densest} subset subject to {\em co-matroid constraints}. We are given a {\em monotone supermodular} set function $f$ defined over a universe $U$, and the density of a subset $S$ is defined to be $f(S)/\crd{S}$. This generalizes the concept of graph density. Co-matroid constraints are the following: given matroid $\calM$ a set $S$ is feasible, iff the complement of $S$ is {\em independent} in the matroid. Under such constraints, the problem becomes $\np$-hard. The specific case of graph density has been considered in literature under specific co-matroid constraints, for example, the cardinality matroid and the partition matroid. We show a 2-approximation for finding the densest subset subject to co-matroid constraints. Thus, for instance, we improve the approximation guarantees for the result for partition matroids in the literature.

preprint2012arXiv

Distributed Algorithms for Scheduling on Line and Tree Networks

We have a set of processors (or agents) and a set of graph networks defined over some vertex set. Each processor can access a subset of the graph networks. Each processor has a demand specified as a pair of vertices $<u, v>$, along with a profit; the processor wishes to send data between $u$ and $v$. Towards that goal, the processor needs to select a graph network accessible to it and a path connecting $u$ and $v$ within the selected network. The processor requires exclusive access to the chosen path, in order to route the data. Thus, the processors are competing for routes/channels. A feasible solution selects a subset of demands and schedules each selected demand on a graph network accessible to the processor owning the demand; the solution also specifies the paths to use for this purpose. The requirement is that for any two demands scheduled on the same graph network, their chosen paths must be edge disjoint. The goal is to output a solution having the maximum aggregate profit. Prior work has addressed the above problem in a distibuted setting for the special case where all the graph networks are simply paths (i.e, line-networks). Distributed constant factor approximation algorithms are known for this case. The main contributions of this paper are twofold. First we design a distributed constant factor approximation algorithm for the more general case of tree-networks. The core component of our algorithm is a tree-decomposition technique, which may be of independent interest. Secondly, for the case of line-networks, we improve the known approximation guarantees by a factor of 5. Our algorithms can also handle the capacitated scenario, wherein the demands and edges have bandwidth requirements and capacities, respectively.

preprint2012arXiv

Mapping Strategies for the PERCS Architecture

The PERCS system was designed by IBM in response to a DARPA challenge that called for a high-productivity high-performance computing system. The IBM PERCS architecture is a two level direct network having low diameter and high bisection bandwidth. Mapping and routing strategies play an important role in the performance of applications on such a topology. In this paper, we study mapping strategies for PERCS architecture, that examine how to map tasks of a given job on to the physical processing nodes. We develop and present fundamental principles for designing good mapping strategies that minimize congestion. This is achieved via a theoretical study of some common communication patterns under both direct and indirect routing mechanisms supported by the architecture.

preprint2012arXiv

Scheduling Resources for Executing a Partial Set of Jobs

In this paper, we consider the problem of choosing a minimum cost set of resources for executing a specified set of jobs. Each input job is an interval, determined by its start-time and end-time. Each resource is also an interval determined by its start-time and end-time; moreover, every resource has a capacity and a cost associated with it. We consider two versions of this problem. In the partial covering version, we are also given as input a number k, specifying the number of jobs that must be performed. The goal is to choose k jobs and find a minimum cost set of resources to perform the chosen k jobs (at any point of time the capacity of the chosen set of resources should be sufficient to execute the jobs active at that time). We present an O(log n)-factor approximation algorithm for this problem. We also consider the prize collecting version, wherein every job also has a penalty associated with it. The feasible solution consists of a subset of the jobs, and a set of resources, to perform the chosen subset of jobs. The goal is to find a feasible solution that minimizes the sum of the costs of the selected resources and the penalties of the jobs that are not selected. We present a constant factor approximation algorithm for this problem

preprint2011arXiv

The update complexity of selection and related problems

We present a framework for computing with input data specified by intervals, representing uncertainty in the values of the input parameters. To compute a solution, the algorithm can query the input parameters that yield more refined estimates in form of sub-intervals and the objective is to minimize the number of queries. The previous approaches address the scenario where every query returns an exact value. Our framework is more general as it can deal with a wider variety of inputs and query responses and we establish interesting relationships between them that have not been investigated previously. Although some of the approaches of the previous restricted models can be adapted to the more general model, we require more sophisticated techniques for the analysis and we also obtain improved algorithms for the previous model. We address selection problems in the generalized model and show that there exist 2-update competitive algorithms that do not depend on the lengths or distribution of the sub-intervals and hold against the worst case adversary. We also obtain similar bounds on the competitive ratio for the MST problem in graphs.

preprint2010arXiv

On the Complexity of the $k$-Anonymization Problem

We study the problem of anonymizing tables containing personal information before releasing them for public use. One of the formulations considered in this context is the $k$-anonymization problem: given a table, suppress a minimum number of cells so that in the transformed table, each row is identical to atleast $k-1$ other rows. The problem is known to be NP-hard and MAXSNP-hard; but in the known reductions, the number of columns in the constructed tables is arbitrarily large. However, in practical settings the number of columns is much smaller. So, we study the complexity of the practical setting in which the number of columns $m$ is small. We show that the problem is NP-hard, even when the number of columns $m$ is a constant ($m=3$). We also prove MAXSNP-hardness for this restricted version and derive that the problem cannot be approximated within a factor of (6238/6237). Our reduction uses alphabets $Σ$ of arbitrarily large size. A natural question is whether the problem remains NP-hard when both $m$ and $|Σ|$ are small. We prove that the $k$-anonymization problem is in $P$ when both $m$ and $|Σ|$ are constants.

Yogish Sabharwal

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Effective Elastic Scaling of Deep Learning Workloads

On Optimizing Distributed Tucker Decomposition for Sparse Tensors

PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination

Improvable Knapsack Problems

Subgraph Counting: Color Coding Beyond Trees

Distributed and Parallel Algorithms for Set Cover Problems with Small Neighborhood Covers

Density Functions subject to a Co-Matroid Constraint

Distributed Algorithms for Scheduling on Line and Tree Networks

Mapping Strategies for the PERCS Architecture

Scheduling Resources for Executing a Partial Set of Jobs

The update complexity of selection and related problems

On the Complexity of the $k$-Anonymization Problem