Researcher profile

Cristóbal A. Navarro

Cristóbal A. Navarro contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2022arXiv

A Scalable and Energy Efficient GPU Thread Map for m-Simplex Domains

This work proposes a new GPU thread map for $m$-simplex domains, that scales its speedup with dimension and is energy efficient compared to other state of the art approaches. The main contributions of this work are i) the formulation of the new block-space map $\mathcal{H}: \mathbb{Z}^m \mapsto \mathbb{Z}^m$ for regular orthogonal simplex domains, which is analyzed in terms of resource usage, and ii) the experimental evaluation in terms of speedup over a bounding box approach and energy efficiency as elements per second per Watt. Results from the analysis show that $\mathcal{H}$ has a potential speedup of up to $2\times$ and $6\times$ for $2$ and $3$-simplices, respectively. Experimental evaluation shows that $\mathcal{H}$ is competitive for $2$-simplices, reaching $1.2\times \sim 2.0\times$ of speedup for different tests, which is on par with the fastest state of the art approaches. For $3$-simplices $\mathcal{H}$ reaches up to $1.3\times \sim 6.0\times$ of speedup making it the fastest of all. The extension of $\mathcal{H}$ to higher dimensional $m$-simplices is feasible and has a potential speedup that scales as $m!$ given a proper selection of parameters $r, β$ which are the scaling and replication factors, respectively. In terms of energy consumption, although $\mathcal{H}$ is among the highest in power consumption, it compensates by its short duration, making it one of the most energy efficient approaches. Lastly, further improvements with Tensor and Ray Tracing Cores are analyzed, giving insights to leverage each one of them. The results obtained in this work show that $\mathcal{H}$ is a scalable and energy efficient map that can contribute to the efficiency of GPU applications when they need to process $m$-simplex domains, such as Cellular Automata or PDE simulations.

preprint2022arXiv

GGArray: A Dynamically Growable GPU Array

We present a dynamically Growable GPU array (GGArray) fully implemented in GPU that does not require synchronization with the host. The idea is to improve the programming of GPU applications that require dynamic memory, by offering a structure that does not require pre-allocating GPU VRAM for the worst case scenario. The GGArray is based on the LFVector, by utilizing an array of them in order to take advantage of the GPU architecture and the synchronization offered by thread blocks. This structure is compared to other state of the art ones such as a pre-allocated static array and a semi-static array that needs to be resized through communication with the host. Experimental evaluation shows that the GGArray has a competitive insertion and resize performance, but it is slower for regular parallel memory accesses. Given the results, the GGArray is a potentially useful structure for applications with high uncertainty on the memory usage as well as applications that have phases, such as an insertion phase followed by a regular GPU phase. In such cases, the GGArray can be used for the first phase and then data can be flattened for the second phase in order to allow the classical GPU memory accesses which are faster. These results constitute a step towards achieving a parallel efficient C++ like vector for modern GPU architectures.

preprint2022arXiv

GPU Voronoi Diagrams for Random Moving Seeds

The Voronoi Diagram is a geometrical structure that is widely used in scientific or technological applications where proximity is a relevant aspect to consider, and it also resembles natural phenomena such as cellular banks, rock formations or bee hives, among others. Typically, computing the Voronoi Diagram is done in a static context, that is, the location of the input seeds is defined once and does not change. In this work we study the dynamic case where seeds move, which leads to a dynamic Voronoi Diagram that changes over time. In particular, we consider uniform random moving seeds, for which we propose the \textit{dynamic Jump Flooding Algorithm} (dJFA), a variant of JFA that uses less iterations than the standard JFA. An experimental evaluation shows that dJFA achieves a speedup of up to $\sim 5.3 \times$ over JFA, while maintaining a similarity of at least $88\%$ and close to $100\%$ in many cases. These results contribute with a step towards the achievement of real-time GPU-based computation of dynamic Voronoi diagrams for any particle simulation.

preprint2022arXiv

Modeling GPU Dynamic Parallelism for Self Similar Density Workloads

Dynamic Parallelism (DP) is a runtime feature of the GPU programming model that allows GPU threads to execute additional GPU kernels, recursively. Apart from making the programming of parallel hierarchical patterns easier, DP can also speedup problems that exhibit a heterogeneous data layout by focusing, through a subdivision process, the finite GPU resources on the sub-regions that exhibit more parallelism. However, doing an optimal subdivision process is not trivial, as there are different parameters that play an important role in the final performance of DP. Moreover, the current programming abstraction for DP also introduces an overhead that can penalize the final performance. In this work we present a subdivision cost model for problems that exhibit self similar density (SSD) workloads (such as fractals), in order understand what parameters provide the fastest subdivision approach. Also, we introduce a new subdivision implementation, named \textit{Adaptive Serial Kernels} (ASK), as a smaller overhead alternative to CUDA's Dynamic Parallelism. Using the cost model on the Mandelbrot Set as a case study shows that the optimal scheme is to start with an initial subdivision between $g=[2,16]$, then keep subdividing in regions of $r=2,4$, and stop when regions reach a size of $B \sim 32$. The experimental results agree with the theoretical parameters, confirming the usability of the cost model. In terms of performance, the proposed ASK approach runs up to $\sim 60\%$ faster than Dynamic Parallelism in the Mandelbrot set, and up to $12\times$ faster than a basic exhaustive implementation, whereas DP is up to $7.5\times$.

preprint2022arXiv

Squeeze: Efficient Compact Fractals for Tensor Core GPUs

This work presents Squeeze, an efficient compact fractal processing scheme for tensor core GPUs. By combining discrete-space transformations between compact and expanded forms, one can do data-parallel computation on a fractal with neighborhood access without needing to expand the fractal in memory. The space transformations are formulated as two GPU tensor-core accelerated thread maps, $λ(ω)$ and $ν(ω)$, which act as compact-to-expanded and expanded-to-compact space functions, respectively. The cost of the maps is $\mathcal{O}(\log_2 \log_s(n))$ time, with $n$ being the side of a $n \times n$ embedding for the fractal in its expanded form, and $s$ the linear scaling factor. The proposed approach works for any fractal that belongs to the Non-overlapping-Bounding-Boxes (NBB) class of discrete fractals, and can be extended to three dimensions as well. Experimental results using a discrete Sierpinski Triangle as a case study shows up to $\sim12\times$ of speedup and a memory reduction factor of up to $\sim 315\times$ with respect to a GPU-based expanded-space bounding box approach. These results show that the proposed compact approach will allow the scientific community to efficiently tackle problems that up to now could not fit into GPU memory.

preprint2020arXiv

Efficient GPU Thread Mapping on Embedded 2D Fractals

This work proposes a new approach for mapping GPU threads onto a family of discrete embedded 2D fractals. A block-space map $λ: \mathbb{Z}_{\mathbb{E}}^{2} \mapsto \mathbb{Z}_{\mathbb{F}}^{2}$ is proposed, from Euclidean parallel space $\mathbb{E}$ to embedded fractal space $\mathbb{F}$, that maps in $\mathcal{O}(\log_2 \log_2(n))$ time and uses no more than $\mathcal{O}(n^\mathbb{H})$ threads with $\mathbb{H}$ being the Hausdorff dimension of the fractal, making it parallel space efficient. When compared to a bounding-box (BB) approach, $λ(ω)$ offers a sub-exponential improvement in parallel space and a monotonically increasing speedup $n \ge n_0$. The Sierpinski gasket fractal is used as a particular case study and the experimental performance results show that $λ(ω)$ reaches up to $9\times$ of speedup over the bounding-box approach. A tensor-core based implementation of $λ(ω)$ is also proposed for modern GPUs, providing up to $\sim40\%$ of extra performance. The results obtained in this work show that doing efficient GPU thread mapping on fractal domains can significantly improve the performance of several applications that work with this type of geometry.

preprint2020arXiv

GPU Tensor Cores for fast Arithmetic Reductions

This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is $T(n)=5 log_{m^2}{n}$ and its speedup is $S=\dfrac{4}{5} log_{2}{m^2}$ over the classic $O(n \log n)$ parallel reduction algorithm. Experimental performance results show that the proposed reduction method is $\sim 3.2 \times$ faster than a conventional GPU reduction implementation, and preserves the numerical precision because the sub-results of each chain of $R$ MMAs is kept as a 32-bit floating point value, before being all reduced into as a final 32-bit result. The chained MMA design allows a flexible configuration of thread-blocks; small thread-blocks of 32 or 128 threads can still achieve maximum performance using a chain of $R=4,5$ MMAs per block, while large thread-blocks work best with $R=1$. The results obtained in this work show that tensor cores can indeed provide a significant performance improvement to non-Machine Learning applications such as the arithmetic reduction, which is an integration tool for studying many scientific phenomena.