Researcher profile

Hamid Reza Zohouri

Hamid Reza Zohouri contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 15 - UnverifiedVerification L1Unclaimed author
3works
0followers
2topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

3 published item(s)

preprint2020arXiv

AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs

Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code. Parameter tuning in our framework is guided by our performance model. Our novel optimization strategy reduces shared memory and register pressure in comparison to existing implementations, allowing performance scaling up to a temporal blocking degree of 10. We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU.

preprint2020arXiv

High-Performance High-Order Stencil Computation on FPGAs Using OpenCL

In this paper we evaluate the performance of FPGAs for high-order stencil computation using High-Level Synthesis. We show that despite the higher computation intensity and on-chip memory requirement of such stencils compared to first-order ones, our design technique with combined spatial and temporal blocking remains effective. This allows us to reach similar, or even higher, compute performance compared to first-order stencils. We use an OpenCL-based design that, apart from parameterizing performance knobs, also parameterizes the stencil radius. Furthermore, we show that our performance model exhibits the same accuracy as first-order stencils in predicting the performance of high-order ones. On an Intel Arria 10 GX 1150 device, for 2D and 3D star-shaped stencils, we achieve over 700 and 270 GFLOP/s of compute performance, respectively, up to a stencil radius of four. These results outperform the state-of-the-art YASK framework on a modern Xeon for 2D and 3D stencils, and outperform a modern Xeon Phi for 2D stencils, while achieving competitive performance in 3D. Furthermore, our FPGA design achieves better power efficiency in almost all cases.

preprint2020arXiv

The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface

Supported by their high power efficiency and recent advancements in High Level Synthesis (HLS), FPGAs are quickly finding their way into HPC and cloud systems. Large amounts of work have been done so far on loop and area optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and efficiency of the memory controller of FPGAs is missing in literature, which becomes even more crucial when the limited memory bandwidth of modern FPGAs compared to their GPU counterparts is taken into account. In this work, we will analyze the memory interface generated by Intel FPGA SDK for OpenCL with different configurations for input/output arrays, vector size, interleaving, kernel programming model, on-chip channels, operating frequency, padding, and multiple types of overlapped blocking. Our results point to multiple shortcomings in the memory controller of Intel FPGAs, especially with respect to memory access alignment, that can hinder the programmer's ability in maximizing memory performance in their design. For some of these cases, we will provide work-arounds to improve memory bandwidth efficiency; however, a general solution will require major changes in the memory controller itself.