Researcher profile

Akihiro Hayashi

Akihiro Hayashi contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 13 - UnverifiedVerification L1Unclaimed author
2works
0followers
1topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

2 published item(s)

preprint2022arXiv

A Scalable Actor-based Programming System for PGAS Runtimes

The PGAS model is well suited for executing irregular applications on cluster-based systems, due to its efficient support for short, one-sided messages. However, there are currently two major limitations faced by PGAS applications. The first relates to scalability: despite the availability of APIs that support non-blocking operations in special cases, many PGAS operations on remote locations are synchronous by default, which can lead to long-latency stalls and poor scalability. The second relates to productivity: while it is simpler for the developer to express all communications at a fine-grained granularity that is natural to the application, experiments have shown that such a natural expression results in performance that is 20x slower than more efficient but less productive code that requires manual message aggregation and termination detection. In this paper, we introduce a new programming system for PGAS applications, in which point-to-point remote operations can be expressed as fine-grained asynchronous actor messages. In this approach, the programmer does not need to worry about programming complexities related to message aggregation and termination detection. Our approach can also be viewed as extending the classical Bulk Synchronous Parallelism model with fine-grained asynchronous communications within a phase or superstep. We believe that our approach offers a desirable point in the productivity-performance space for PGAS applications, with more scalable performance and higher productivity relative to past approaches. Specifically, for seven irregular mini-applications from the Bale benchmark suite executed using 2048 cores in the NERSC Cori system, our approach shows geometric mean performance improvements of >=20x relative to standard PGAS versions (UPC and OpenSHMEM) while maintaining comparable productivity to those versions.

preprint2022arXiv

Automatic Parallelization of Python Programs for Distributed Heterogeneous Computing

This paper introduces a novel approach to automatic ahead-of-time (AOT) parallelization and optimization of sequential Python programs for execution on distributed heterogeneous platforms. Our approach enables AOT source-to-source transformation of Python programs, driven by the inclusion of type hints for function parameters and return values. These hints can be supplied by the programmer or obtained by dynamic profiler tools; multi-version code generation guarantees the correctness of our AOT transformation in all cases. Our compilation framework performs automatic parallelization and sophisticated high-level code optimizations for the target distributed heterogeneous hardware platform. It includes extensions to the polyhedral framework that unify user-written loops and implicit loops present in matrix/tensor operators, as well as automated section of CPU vs. GPU code variants. Further, our polyhedral optimizations enable both intra-node and inter-node parallelism. Finally, the optimized output code is deployed using the Ray runtime for scheduling distributed tasks across multiple heterogeneous nodes in a cluster. Our empirical evaluation shows significant performance improvements relative to sequential Python in both single-node and multi-node experiments, with a performance improvement of over 20,000$\times$ when using 24 nodes and 144 GPUs in the OLCF Summit supercomputer for the Space-Time Adaptive Processing (STAP) radar application.