Source author record

Vijay Chidambaram

Vijay Chidambaram appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Machine Learning Operating Systems Cryptography and Security cs.CY Databases eess.SY Software Engineering Systems and Control

Catalog footprint

What is connected

5works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Finding and Analyzing Crash-Consistency Bugs in Persistent-Memory File Systems

We present a study of crash-consistency bugs in persistent-memory (PM) file systems and analyze their implications for file-system design and testing crash consistency. We develop FlyTrap, a framework to test PM file systems for crash-consistency bugs. FlyTrap discovered 18 new bugs across four PM file systems; the bugs have been confirmed by developers and many have been already fixed. The discovered bugs have serious consequences such as breaking the atomicity of rename or making the file system unmountable. We present a detailed study of the bugs we found and discuss some important lessons from these observations. For instance, one of our findings is that many of the bugs are due to logic errors, rather than errors in using flushes or fences; this has important applications for future work on testing PM file systems. Another key finding is that many bugs arise from attempts to improve efficiency by performing metadata updates in-place and that recovery code that deals with rebuilding in-DRAM state is a significant source of bugs. These observations have important implications for designing and testing PM file systems. Our code is available at https://github.com/utsaslab/flytrap .

preprint2022arXiv

Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters

Training Deep Neural Networks (DNNs) is a widely popular workload in both enterprises and cloud data centers. Existing schedulers for DNN training consider GPU as the dominant resource, and allocate other resources such as CPU and memory proportional to the number of GPUs requested by the job. Unfortunately, these schedulers do not consider the impact of a job's sensitivity to allocation of CPU, memory, and storage resources. In this work, we propose Synergy, a resource-sensitive scheduler for shared GPU clusters. Synergy infers the sensitivity of DNNs to different resources using optimistic profiling; some jobs might benefit from more than the GPU-proportional allocation and some jobs might not be affected by less than GPU-proportional allocation. Synergy performs such multi-resource workload-aware assignments across a set of jobs scheduled on shared multi-tenant clusters using a new near-optimal online algorithm. Our experiments show that workload-aware CPU and memory allocations can improve average JCT up to 3.4x when compared to traditional GPU-proportional scheduling.

preprint2021arXiv

Analyzing and Mitigating Data Stalls in DNN Training

Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While prior research has explored many different ways of reducing DNN training time, the impact of input data pipeline, i.e., fetching raw data items from storage and performing data pre-processing in memory, has been relatively unexplored. This paper makes the following contributions: (1) We present the first comprehensive analysis of how the input data pipeline affects the training time of widely-used computer vision and audio Deep Neural Networks (DNNs), that typically involve complex data preprocessing. We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft. We find that in many cases, DNN training time is dominated by data stall time: time spent waiting for data to be fetched and preprocessed. (2) We build a tool, DS-Analyzer to precisely measure data stalls using a differential technique, and perform predictive what-if analysis on data stalls. (3) Finally, based on the insights from our analysis, we design and implement three simple but effective techniques in a data-loading library, CoorDL, to mitigate data stalls. Our experiments on a range of DNN tasks, models, datasets, and hardware configs show that when PyTorch uses CoorDL instead of the state-of-the-art DALI data loading library, DNN training time is reduced significantly (by as much as 5x on a single server).

preprint2020arXiv

Towards Software-Defined Data Protection: GDPR Compliance at the Storage Layer is Within Reach

Enforcing data protection and privacy rules within large data processing applications is becoming increasingly important, especially in the light of GDPR and similar regulatory frameworks. Most modern data processing happens on top of a distributed storage layer, and securing this layer against accidental or malicious misuse is crucial to ensuring global privacy guarantees. However, the performance overhead and the additional complexity for this is often assumed to be significant -- in this work we describe a path forward that tackles both challenges. We propose "Software-Defined Data Protection" (SDP), an adoption of the "Software-Defined Storage" approach to non-performance aspects: a trusted controller translates company and application-specific policies to a set of rules deployed on the storage nodes. These, in turn, apply the rules at line-rate but do not take any decisions on their own. Such an approach decouples often changing policies from request-level enforcement and allows storage nodes to implement the latter more efficiently. Even though in-storage processing brings challenges, mainly because it can jeopardize line-rate processing, we argue that today's Smart Storage solutions can already implement the required functionality, thanks to the separation of concerns introduced by SDP. We highlight the challenges that remain, especially that of trusting the storage nodes. These need to be tackled before we can reach widespread adoption in cloud environments.

preprint2020arXiv

Understanding and Benchmarking the Impact of GDPR on Database Systems

The General Data Protection Regulation (GDPR) provides new rights and protections to European people concerning their personal data. We analyze GDPR from a systems perspective, translating its legal articles into a set of capabilities and characteristics that compliant systems must support. Our analysis reveals the phenomenon of metadata explosion, wherein large quantities of metadata needs to be stored along with the personal data to satisfy the GDPR requirements. Our analysis also helps us identify new workloads that must be supported under GDPR. We design and implement an open-source benchmark called GDPRbench that consists of workloads and metrics needed to understand and assess personal-data processing database systems. To gauge the readiness of modern database systems for GDPR, we follow best practices and developer recommendations to modify Redis, PostgreSQL, and a commercial database system to be GDPR compliant. Our experiments demonstrate that the resulting GDPR compliant systems achieve poor performance on GPDR workloads, and that performance scales poorly as the volume of personal data increases. We discuss the real-world implications of these findings, and identify research challenges towards making GDPR compliance efficient in production environments. We release all of our software artifacts and datasets at http://www.gdprbench.org

Vijay Chidambaram

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

Finding and Analyzing Crash-Consistency Bugs in Persistent-Memory File Systems

Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters

Analyzing and Mitigating Data Stalls in DNN Training

Towards Software-Defined Data Protection: GDPR Compliance at the Storage Layer is Within Reach

Understanding and Benchmarking the Impact of GDPR on Database Systems