Source author record

Samuel Xavier-de-Souza

Samuel Xavier-de-Souza appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing

Catalog footprint

What is connected

3works

1topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

An OpenMP translator for the GAP8 MPSoC

One of the barriers to the adoption of parallel computing is the inherent complexity of its programming. The Open Multi-Processing (OpenMP) Application Programming Interface (API) facilitates such implementations, providing high abstraction level directives. On another front, new architectures aimed at low energy consumption have been developed, such as the Greenwaves Technologies GAP8, a Multi-Processor System-on-Chip (MPSoC) based on the Parallel Ultra Low Power (PULP) Platform. The GAP8 has an 8-core cluster and a Fabric Controller(FC) master core. Parallel programming with GAP8 is very promising on the efficiency side, but its recent development and lack of a robust OS to handle threads and core scheduling complicate a simple implementation of the OpenMP APIs. This project implements a source to source translator that interprets a limited set of OpenMP directives, and is capable of generating parallel microcontroller code manipulating the cores directly. The preliminary results obtained in this work shows a reduction of the code size, if compared with the base implementation, proving the efficiency of the project to ease the programming of the GAP8. Further work is need in order to implement more OpenMP directives.

preprint2020arXiv

Auto-tuning of dynamic scheduling applied to 3D reverse time migration on multicore systems

Reverse time migration (RTM) is an algorithm widely used in the oil and gas industry to process seismic data. It is a computationally intensive task that suits well in parallel computers. Methods such as RTM can be parallelized in shared memory systems through scheduling iterations of parallel loops to threads. However, several aspects, such as memory size and hierarchy, number of cores, and input size, make optimal scheduling very challenging. In this paper, we introduce a run-time strategy to automatically tune the dynamic scheduling of parallel loops iterations in iterative applications, such as the RTM, in multicore systems. The proposed method aims to reduce the execution time of such applications. To find the optimal granularity, we propose a coupled simulated annealing (CSA) based auto-tuning strategy that adjusts the chunk size of work that OpenMP parallel loops assign dynamically to worker threads during the initialization of a 3D RTM application. Experiments performed with different computational systems and input sizes show that the proposed method is consistently better than the default OpenMP schedulers, static, auto, and guided, causing the application to be up to 33% faster. We show that the possible reason for this performance is the reduction of cache misses, mainly level L3, and low overhead, inferior to 2%. Having shown to be robust and scalable for the 3D RTM, the proposed method could also improve the performance of similar wave-based algorithms, such as full-waveform inversion (FWI) and other iterative applications.

preprint2020arXiv

When parallel speedups hit the memory wall

After Amdahl's trailblazing work, many other authors proposed analytical speedup models but none have considered the limiting effect of the memory wall. These models exploited aspects such as problem-size variation, memory size, communication overhead, and synchronization overhead, but data-access delays are assumed to be constant. Nevertheless, such delays can vary, for example, according to the number of cores used and the ratio between processor and memory frequencies. Given the large number of possible configurations of operating frequency and number of cores that current architectures can offer, suitable speedup models to describe such variations among these configurations are quite desirable for off-line or on-line scheduling decisions. This work proposes new parallel speedup models that account for variations of the average data-access delay to describe the limiting effect of the memory wall on parallel speedups. Analytical results indicate that the proposed modeling can capture the desired behavior while experimental hardware results validate the former. Additionally, we show that when accounting for parameters that reflect the intrinsic characteristics of the applications, such as degree of parallelism and susceptibility to the memory wall, our proposal has significant advantages over machine-learning-based modeling. Moreover, besides being black-box modeling, our experiments show that conventional machine-learning modeling needs about one order of magnitude more measurements to reach the same level of accuracy achieved in our modeling.

Samuel Xavier-de-Souza

What is connected

Connect this record

See the researcher in context

Building this map preview

3 published item(s)

An OpenMP translator for the GAP8 MPSoC

Auto-tuning of dynamic scheduling applied to 3D reverse time migration on multicore systems

When parallel speedups hit the memory wall