Source author record

Alper Buyuktosunoglu

Alper Buyuktosunoglu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Hardware Architecture eess.SY Machine Learning Operating Systems Performance Systems and Control

Catalog footprint

What is connected

4works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

HetSched: Quality-of-Mission Aware Scheduling for Autonomous Vehicle SoCs

Systems-on-Chips (SoCs) that power autonomous vehicles (AVs) must meet stringent performance and safety requirements prior to deployment. With increasing complexity in AV applications, the system needs to meet these real-time demands of multiple safety-critical applications simultaneously. A typical AV-SoC is a heterogeneous multiprocessor consisting of accelerators supported by general-purpose cores. Such heterogeneity, while needed for power-performance efficiency, complicates the art of task scheduling. In this paper, we demonstrate that hardware heterogeneity impacts the scheduler's effectiveness and that optimizing for only the real-time aspect of applications is not sufficient in AVs. Therefore, a more holistic approach is required -- one that considers global Quality-of-Mission (QoM) metrics, as defined in the paper. We then propose HetSched, a multi-step scheduler that leverages dynamic runtime information about the underlying heterogeneous hardware platform, along with the applications' real-time constraints and the task traffic in the system to optimize overall mission performance. HetSched proposes two scheduling policies: MSstat and MSdyn and scheduling optimizations like task pruning, hybrid heterogeneous ranking and rank update. HetSched improves overall mission performance on average by 4.6x, 2.6x and 2.6x when compared against CPATH, ADS and 2lvl-EDF (state-of-the-art real-time schedulers built for heterogeneous systems), respectively, and achieves an average of 53.3% higher hardware utilization, while meeting 100% critical deadlines for real-world applications of autonomous vehicles. Furthermore, when used as part of an SoC design space exploration loop, in comparison to prior schedulers, HetSched reduces the number of processing elements required by an SoC to safely complete AV's missions by 35% on average while achieving 2.7x lower energy-mission time product.

preprint2020arXiv

Improving Efficiency in Large-Scale Decentralized Distributed Training

Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchronous Parallel SGD (AD-PSGD) is a family of distributed learning algorithms that have been demonstrated to perform well for large-scale deep learning tasks. One drawback of (A)D-PSGD is that the spectral gap of the mixing matrix decreases when the number of learners in the system increases, which hampers convergence. In this paper, we investigate techniques to accelerate (A)D-PSGD based training by improving the spectral gap while minimizing the communication cost. We demonstrate the effectiveness of our proposed techniques by running experiments on the 2000-hour Switchboard speech recognition task and the ImageNet computer vision task. On an IBM P9 supercomputer, our system is able to train an LSTM acoustic model in 2.28 hours with 7.5% WER on the Hub5-2000 Switchboard (SWB) test set and 13.3% WER on the CallHome (CH) test set using 64 V100 GPUs and in 1.98 hours with 7.7% WER on SWB and 13.3% WER on CH using 128 V100 GPUs, the fastest training time reported to date.

preprint2020arXiv

STOMP: A Tool for Evaluation of Scheduling Policies in Heterogeneous Multi-Processors

The proliferation of heterogeneous chip multiprocessors in recent years has reached unprecedented levels. Traditional homogeneous platforms have shown fundamental limitations when it comes to enabling high-performance yet-ultra-low-power computing, in particular in application domains with real-time execution deadlines or criticality constraints. By combining the right set of general purpose cores and hardware accelerators together, along with proper chip interconnects and memory technology, heterogeneous chip multiprocessors have become an effective high-performance and low-power computing alternative. One of the challenges of heterogeneous architectures relates to efficient scheduling of application tasks (processes, threads) across the variety of options in the chip. As a result, it is key to provide tools to enable early-stage prototyping and evaluation of new scheduling policies for heterogeneous platforms. In this paper, we present STOMP (Scheduling Techniques Optimization in heterogeneous Multi-Processors), a simulator for fast implementation and evaluation of task scheduling policies in multi-core/multi-processor systems with a convenient interface for "plugging" in new scheduling policies in a simple manner. Thorough validation of STOMP exhibits small relative errors when compared against closed-formed equivalent models during steady-state analysis.

preprint2019arXiv

Touché: Towards Ideal and Efficient Cache Compression By Mitigating Tag Area Overheads

Compression is seen as a simple technique to increase the effective cache capacity. Unfortunately, compression techniques either incur tag area overheads or restrict data placement to only include neighboring compressed cache blocks to mitigate tag area overheads. Ideally, we should be able to place arbitrary compressed cache blocks without any placement restrictions and tag area overheads. This paper proposes Touché, a framework that enables storing multiple arbitrary compressed cache blocks within a physical cacheline without any tag area overheads. The Touché framework consists of three components. The first component, called the ``Signature'' (SIGN) engine, creates shortened signatures from the tag addresses of compressed blocks. Due to this, the SIGN engine can store multiple signatures in each tag entry. On a cache access, the physical cacheline is accessed only if there is a signature match (which has a negligible probability of false positive). The second component, called the ``Tag Appended Data'' (TADA) mechanism, stores the full tag addresses with data. TADA enables Touché to detect false positive signature matches by ensuring that the actual tag address is available for comparison. The third component, called the ``Superblock Marker'' (SMARK) mechanism, uses a unique marker in the tag entry to indicate the occurrence of compressed cache blocks from neighboring physical addresses in the same cacheline. Touché is completely hardware-based and achieves an average speedup of 12\% (ideal 13\%) when compared to an uncompressed baseline.