Source author record

Tomofumi Yuki

Tomofumi Yuki appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Hardware Architecture Programming Languages

Catalog footprint

What is connected

2works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Increasing FPGA Accelerators Memory Bandwidth with a Burst-Friendly Memory Layout

Offloading compute-intensive kernels to hardware accelerators relies on the large degree of parallelism offered by these platforms. However, the effective bandwidth of the memory interface often causes a bottleneck, hindering the accelerator's effective performance. Techniques enabling data reuse, such as tiling, lower the pressure on memory traffic but still often leave the accelerators I/O-bound. A further increase in effective bandwidth is possible by using burst rather than element-wise accesses, provided the data is contiguous in memory. In this paper, we propose a memory allocation technique, and provide a proof-of-concept source-to-source compiler pass, that enables such burst transfers by modifying the data layout in external memory. We assess how this technique pushes up the memory throughput, leaving room for exploiting additional parallelism, for a minimal logic overhead.

preprint2013arXiv

Checking Race Freedom of Clocked X10 Programs

One of many approaches to better take advantage of parallelism, which has now become mainstream, is the introduction of parallel programming languages. However, parallelism is by nature non-deterministic, and not all parallel bugs can be avoided by language design. This paper proposes a method for guaranteeing absence of data races in the polyhedral subset of clocked X10 programs. Clocks in X10 are similar to barriers, but are more dynamic; the subset of processes that participate in the synchronization can dynamically change at runtime. We construct the happens-before relation for clocked X10 programs, and show that the problem of race detection is undecidable. However, in many practical cases, modern tools are able to find solutions or disprove their existence. We present a set of benchmarks for which the analysis is possible and has an acceptable running time.