Researcher profile

Samuel Xavier-de-Souza

Samuel Xavier-de-Souza contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 15 - UnverifiedVerification L1Unclaimed author
3works
0followers
1topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

3 published item(s)

preprint2020arXiv

An OpenMP translator for the GAP8 MPSoC

One of the barriers to the adoption of parallel computing is the inherent complexity of its programming. The Open Multi-Processing (OpenMP) Application Programming Interface (API) facilitates such implementations, providing high abstraction level directives. On another front, new architectures aimed at low energy consumption have been developed, such as the Greenwaves Technologies GAP8, a Multi-Processor System-on-Chip (MPSoC) based on the Parallel Ultra Low Power (PULP) Platform. The GAP8 has an 8-core cluster and a Fabric Controller(FC) master core. Parallel programming with GAP8 is very promising on the efficiency side, but its recent development and lack of a robust OS to handle threads and core scheduling complicate a simple implementation of the OpenMP APIs. This project implements a source to source translator that interprets a limited set of OpenMP directives, and is capable of generating parallel microcontroller code manipulating the cores directly. The preliminary results obtained in this work shows a reduction of the code size, if compared with the base implementation, proving the efficiency of the project to ease the programming of the GAP8. Further work is need in order to implement more OpenMP directives.

preprint2020arXiv

Auto-tuning of dynamic scheduling applied to 3D reverse time migration on multicore systems

Reverse time migration (RTM) is an algorithm widely used in the oil and gas industry to process seismic data. It is a computationally intensive task that suits well in parallel computers. Methods such as RTM can be parallelized in shared memory systems through scheduling iterations of parallel loops to threads. However, several aspects, such as memory size and hierarchy, number of cores, and input size, make optimal scheduling very challenging. In this paper, we introduce a run-time strategy to automatically tune the dynamic scheduling of parallel loops iterations in iterative applications, such as the RTM, in multicore systems. The proposed method aims to reduce the execution time of such applications. To find the optimal granularity, we propose a coupled simulated annealing (CSA) based auto-tuning strategy that adjusts the chunk size of work that OpenMP parallel loops assign dynamically to worker threads during the initialization of a 3D RTM application. Experiments performed with different computational systems and input sizes show that the proposed method is consistently better than the default OpenMP schedulers, static, auto, and guided, causing the application to be up to 33% faster. We show that the possible reason for this performance is the reduction of cache misses, mainly level L3, and low overhead, inferior to 2%. Having shown to be robust and scalable for the 3D RTM, the proposed method could also improve the performance of similar wave-based algorithms, such as full-waveform inversion (FWI) and other iterative applications.

preprint2020arXiv

When parallel speedups hit the memory wall

After Amdahl's trailblazing work, many other authors proposed analytical speedup models but none have considered the limiting effect of the memory wall. These models exploited aspects such as problem-size variation, memory size, communication overhead, and synchronization overhead, but data-access delays are assumed to be constant. Nevertheless, such delays can vary, for example, according to the number of cores used and the ratio between processor and memory frequencies. Given the large number of possible configurations of operating frequency and number of cores that current architectures can offer, suitable speedup models to describe such variations among these configurations are quite desirable for off-line or on-line scheduling decisions. This work proposes new parallel speedup models that account for variations of the average data-access delay to describe the limiting effect of the memory wall on parallel speedups. Analytical results indicate that the proposed modeling can capture the desired behavior while experimental hardware results validate the former. Additionally, we show that when accounting for parameters that reflect the intrinsic characteristics of the applications, such as degree of parallelism and susceptibility to the memory wall, our proposal has significant advantages over machine-learning-based modeling. Moreover, besides being black-box modeling, our experiments show that conventional machine-learning modeling needs about one order of magnitude more measurements to reach the same level of accuracy achieved in our modeling.