Source author record

Mahmut T. Kandemir

Mahmut T. Kandemir appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Machine Learning

Catalog footprint

What is connected

3works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

GraphGuess: Approximate Graph Processing System with Adaptive Correction

Graph-based data structures have drawn great attention in recent years. The large and rapidly growing trend on developing graph processing systems focuses mostly on improving the performance by preprocessing the input graph and modifying its layout. These systems usually take several hours to days to complete processing a single graph on high-end machines, let alone the overhead of pre-processing which most of the time can be dominant. Yet for most graph applications the exact answer is not always crucial, and providing a rough estimate of the final result is adequate. Approximate computing is introduced to trade off accuracy of results for computation or energy savings that could not be achieved by conventional techniques alone. In this work, we design, implement and evaluate GraphGuess, inspired from the domain of approximate graph theory and extend it to a general, practical graph processing system. GraphGuess is essentially an approximate graph processing technique with adaptive correction, which can be implemented on top of any graph processing system. We build a vertex-centric processing system based on GraphGuess, where it allows the user to trade off accuracy for better performance. Our experimental studies show that using GraphGuess can significantly reduce the processing time for large scale graphs while maintaining high accuracy.

preprint2022arXiv

Learn Locally, Correct Globally: A Distributed Algorithm for Training Graph Neural Networks

Despite the recent success of Graph Neural Networks (GNNs), training GNNs on large graphs remains challenging. The limited resource capacities of the existing servers, the dependency between nodes in a graph, and the privacy concern due to the centralized storage and model learning have spurred the need to design an effective distributed algorithm for GNN training. However, existing distributed GNN training methods impose either excessive communication costs or large memory overheads that hinders their scalability. To overcome these issues, we propose a communication-efficient distributed GNN training technique named $\text{Learn Locally, Correct Globally}$ (LLCG). To reduce the communication and memory overhead, each local machine in LLCG first trains a GNN on its local data by ignoring the dependency between nodes among different machines, then sends the locally trained model to the server for periodic model averaging. However, ignoring node dependency could result in significant performance degradation. To solve the performance degradation, we propose to apply $\text{Global Server Corrections}$ on the server to refine the locally learned models. We rigorously analyze the convergence of distributed methods with periodic model averaging for training GNNs and show that naively applying periodic model averaging but ignoring the dependency between nodes will suffer from an irreducible residual error. However, this residual error can be eliminated by utilizing the proposed global corrections to entail fast convergence rate. Extensive experiments on real-world datasets show that LLCG can significantly improve the efficiency without hurting the performance.

preprint2020arXiv

Fifer: Tackling Underutilization in the Serverless Era

Datacenters are witnessing a rapid surge in the adoption of serverless functions for microservices-based applications. A vast majority of these microservices typically span less than a second, have strict SLO requirements, and are chained together as per the requirements of an application. The aforementioned characteristics introduce a new set of challenges, especially in terms of container provisioning and management, as the state-of-the-art resource management frameworks, employed in serverless platforms, tend to look at microservice-based applications similar to conventional monolithic applications. Hence, these frameworks suffer from microservice-agnostic scheduling and colossal container over-provisioning, especially during workload fluctuations, thereby resulting in poor resource utilization. In this work, we quantify the above shortcomings using a variety of workloads on a multi-node cluster managed by Kubernetes and Brigade serverless framework. To address them, we propose \emph{Fifer} -- an adaptive resource management framework to efficiently manage function-chains on serverless platforms. The key idea is to make \emph{Fifer} (i) utilization conscious by efficiently bin packing jobs to fewer containers using function-aware container scaling and intelligent request batching, and (ii) at the same time, SLO-compliant by proactively spawning containers to avoid cold-starts, thus minimizing the overall response latency. Combining these benefits, \emph{Fifer} improves container utilization and cluster-wide energy consumption by 4x and 31%, respectively, without compromising on SLO's, when compared to the state-of-the-art schedulers employed by serverless platforms.