Source author record

Chita R. Das

Chita R. Das appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing

Catalog footprint

What is connected

4works

1topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

CASH: A Credit Aware Scheduling for Public Cloud Platforms

The public cloud offers a myriad of services which allows its tenants to process large scale big data in a flexible, easy and cost effective manner. Tenants generally use large scale data processing frameworks such as MapReduce, Tez, Spark etc. to process their data. Tenants can configure their frameworks to run individual tasks by the framework itself or have a middleware cluster manager like YARN or Mesos to arbitrate resource scheduling in their public-cloud cluster. Cluster managers need to be cognizant about the workload requirement along with the state of the individual resource such as CPU and disk in the cluster. Cloud providers use a token bucket mechanism for their individual hardware resources as an indicator of the quality-of-service that individual hardware resource can provide. In this paper, through our changes in YARN, Hadoop and Tez, we show how middleware cluster managers can be made cognizant about the expected quality-of-service of individual hardware resources in the cluster. Our optimized cluster manager with a coarse grained knowledge of task requirement and fine grained knowledge of expected quality-of-service of hardware resources in the cluster performs highly optimal task placements. Our experiments with our optimizations show CPU credit based instances like the Amazon T3 instances as a viable cost effective option for running bigdata workloads. We also show that streaming SQL queries on a Hive warehouse can be accelerated by up to 31% leading to public cloud cost savings of up to 22%.

preprint2020arXiv

Fifer: Tackling Underutilization in the Serverless Era

Datacenters are witnessing a rapid surge in the adoption of serverless functions for microservices-based applications. A vast majority of these microservices typically span less than a second, have strict SLO requirements, and are chained together as per the requirements of an application. The aforementioned characteristics introduce a new set of challenges, especially in terms of container provisioning and management, as the state-of-the-art resource management frameworks, employed in serverless platforms, tend to look at microservice-based applications similar to conventional monolithic applications. Hence, these frameworks suffer from microservice-agnostic scheduling and colossal container over-provisioning, especially during workload fluctuations, thereby resulting in poor resource utilization. In this work, we quantify the above shortcomings using a variety of workloads on a multi-node cluster managed by Kubernetes and Brigade serverless framework. To address them, we propose \emph{Fifer} -- an adaptive resource management framework to efficiently manage function-chains on serverless platforms. The key idea is to make \emph{Fifer} (i) utilization conscious by efficiently bin packing jobs to fewer containers using function-aware container scaling and intelligent request batching, and (ii) at the same time, SLO-compliant by proactively spawning containers to avoid cold-starts, thus minimizing the overall response latency. Combining these benefits, \emph{Fifer} improves container utilization and cluster-wide energy consumption by 4x and 31%, respectively, without compromising on SLO's, when compared to the state-of-the-art schedulers employed by serverless platforms.

preprint2020arXiv

Multiverse: Dynamic VM Provisioning for Virtualized High Performance Computing Clusters

Traditionally, HPC workloads have been deployed in bare-metal clusters; but the advances in virtualization have led the pathway for these workloads to be deployed in virtualized clusters. However, HPC cluster administrators/providers still face challenges in terms of resource elasticity and virtual machine (VM) provisioning at large-scale, due to the lack of coordination between a traditional HPC scheduler and the VM hypervisor (resource management layer). This lack of interaction leads to low cluster utilization and job completion throughput. Furthermore, the VM provisioning delays directly impact the overall performance of jobs in the cluster. Hence, there is a need for effectively provisioning virtualized HPC clusters, which can best-utilize the physical hardware with minimal provisioning overheads. Towards this, we propose Multiverse, a VM provisioning framework, which can dynamically spawn VMs for incoming jobs in a virtualized HPC cluster, by integrating the HPC scheduler along with VM resource manager. We have implemented this framework on the Slurm} scheduler along with the vSphere VM resource manager. In order to reduce the VM provisioning overheads, we use instant cloning which shares both the disk and memory with the parent VM, when compared to full VM cloning which has to boot-up a new VM from scratch. Measurements with real-world HPC workloads demonstrate that, instant cloning is 2.5x faster than full cloning in terms of VM provisioning time. Further, it improves resource utilization by up to 40%, and cluster throughput by up to 1.5x, when compared to full clone for bursty job arrival scenarios.

preprint2020arXiv

Towards Designing a Self-Managed Machine Learning Inference Serving System inPublic Cloud

We are witnessing an increasing trend towardsusing Machine Learning (ML) based prediction systems, span-ning across different application domains, including productrecommendation systems, personal assistant devices, facialrecognition, etc. These applications typically have diverserequirements in terms of accuracy and response latency, thathave a direct impact on the cost of deploying them in a publiccloud. Furthermore, the deployment cost also depends on thetype of resources being procured, which by themselves areheterogeneous in terms of provisioning latencies and billingcomplexity. Thus, it is strenuous for an inference servingsystem to choose from this confounding array of resourcetypes and model types to provide low-latency and cost-effectiveinferences. In this work we quantitatively characterize the cost,accuracy and latency implications of hosting ML inferenceson different public cloud resource offerings. In addition, wecomprehensively evaluate prior work which tries to achievecost-effective prediction-serving. Our evaluation shows that,prior work does not solve the problem from both dimensionsof model and resource heterogeneity. Hence, we argue that toaddress this problem, we need to holistically solve the issuesthat arise when trying to combine both model and resourceheterogeneity towards optimizing for application constraints.Towards this, we envision developing a self-managed inferenceserving system, which can optimize the application require-ments based on public cloud resource characteristics. In orderto solve this complex optimization problem, we explore the highlevel design of a reinforcement-learning based system that canefficiently adapt to the changing needs of the system at scale.

Chita R. Das

What is connected

Connect this record

See the researcher in context

Building this map preview

4 published item(s)

CASH: A Credit Aware Scheduling for Public Cloud Platforms

Fifer: Tackling Underutilization in the Serverless Era

Multiverse: Dynamic VM Provisioning for Virtualized High Performance Computing Clusters

Towards Designing a Self-Managed Machine Learning Inference Serving System inPublic Cloud