Source author record

Nancy Hitschfeld

Nancy Hitschfeld appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Computational Geometry Discrete Mathematics Machine Learning Hardware Architecture

Catalog footprint

What is connected

9works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Scalable and Energy Efficient GPU Thread Map for m-Simplex Domains

This work proposes a new GPU thread map for $m$-simplex domains, that scales its speedup with dimension and is energy efficient compared to other state of the art approaches. The main contributions of this work are i) the formulation of the new block-space map $\mathcal{H}: \mathbb{Z}^m \mapsto \mathbb{Z}^m$ for regular orthogonal simplex domains, which is analyzed in terms of resource usage, and ii) the experimental evaluation in terms of speedup over a bounding box approach and energy efficiency as elements per second per Watt. Results from the analysis show that $\mathcal{H}$ has a potential speedup of up to $2\times$ and $6\times$ for $2$ and $3$-simplices, respectively. Experimental evaluation shows that $\mathcal{H}$ is competitive for $2$-simplices, reaching $1.2\times \sim 2.0\times$ of speedup for different tests, which is on par with the fastest state of the art approaches. For $3$-simplices $\mathcal{H}$ reaches up to $1.3\times \sim 6.0\times$ of speedup making it the fastest of all. The extension of $\mathcal{H}$ to higher dimensional $m$-simplices is feasible and has a potential speedup that scales as $m!$ given a proper selection of parameters $r, β$ which are the scaling and replication factors, respectively. In terms of energy consumption, although $\mathcal{H}$ is among the highest in power consumption, it compensates by its short duration, making it one of the most energy efficient approaches. Lastly, further improvements with Tensor and Ray Tracing Cores are analyzed, giving insights to leverage each one of them. The results obtained in this work show that $\mathcal{H}$ is a scalable and energy efficient map that can contribute to the efficiency of GPU applications when they need to process $m$-simplex domains, such as Cellular Automata or PDE simulations.

preprint2022arXiv

A Topological Data Analysis Based Classifier

Topological Data Analysis (TDA) is an emergent field that aims to discover topological information hidden in a dataset. TDA tools have been commonly used to create filters and topological descriptors to improve Machine Learning (ML) methods. This paper proposes an algorithm that applies TDA directly to multi-class classification problems, without any further ML stage, showing advantages for imbalanced datasets. The proposed algorithm builds a filtered simplicial complex on the dataset. Persistent Homology (PH) is applied to guide the selection of a sub-complex where unlabeled points obtain the label with the majority of votes from labeled neighboring points. We select 8 datasets with different dimensions, degrees of class overlap and imbalanced samples per class. On average, the proposed TDABC method was better than KNN and weighted-KNN. It behaves competitively with Local SVM and Random Forest baseline classifiers in balanced datasets, and it outperforms all baseline methods classifying entangled and minority classes.

preprint2022arXiv

GPU Parallel algorithm for the generation of polygonal meshes based on terminal-edge regions

This paper presents a GPU parallel algorithm to generate a new kind of polygonal meshes obtained from Delaunay triangulations. To generate the polygonal mesh, the algorithm first uses a classification system to label each edge of an input triangulation; second it builds polygons (simple or not) from terminal-edge regions using the label system, and third it transforms each non-simple polygon from the previous phase into simple ones, convex or not convex polygons. We show some preliminary experiments to test the scalability of the algorithm and compare it with the sequential version. We also run a very simple test to show that these meshes can be useful for the virtual element method.

preprint2022arXiv

Squeeze: Efficient Compact Fractals for Tensor Core GPUs

This work presents Squeeze, an efficient compact fractal processing scheme for tensor core GPUs. By combining discrete-space transformations between compact and expanded forms, one can do data-parallel computation on a fractal with neighborhood access without needing to expand the fractal in memory. The space transformations are formulated as two GPU tensor-core accelerated thread maps, $λ(ω)$ and $ν(ω)$, which act as compact-to-expanded and expanded-to-compact space functions, respectively. The cost of the maps is $\mathcal{O}(\log_2 \log_s(n))$ time, with $n$ being the side of a $n \times n$ embedding for the fractal in its expanded form, and $s$ the linear scaling factor. The proposed approach works for any fractal that belongs to the Non-overlapping-Bounding-Boxes (NBB) class of discrete fractals, and can be extended to three dimensions as well. Experimental results using a discrete Sierpinski Triangle as a case study shows up to $\sim12\times$ of speedup and a memory reduction factor of up to $\sim 315\times$ with respect to a GPU-based expanded-space bounding box approach. These results show that the proposed compact approach will allow the scientific community to efficiently tackle problems that up to now could not fit into GPU memory.

preprint2021arXiv

Classification based on Topological Data Analysis

Topological Data Analysis (TDA) is an emergent field that aims to discover topological information hidden in a dataset. TDA tools have been commonly used to create filters and topological descriptors to improve Machine Learning (ML) methods. This paper proposes an algorithm that applies TDA directly to multi-class classification problems, even imbalanced datasets, without any further ML stage. The proposed algorithm built a filtered simplicial complex on the dataset. Persistent homology is then applied to guide choosing a sub-complex where unlabeled points obtain the label with most votes from labeled neighboring points. To assess the proposed method, 8 datasets were selected with several degrees of class entanglement, variability on the samples per class, and dimensionality. On average, the proposed TDABC method was capable of overcoming baseline classifiers (wk-NN and k-NN) in each of the computed metrics, especially on classifying entangled and minority classes.

preprint2020arXiv

Efficient GPU Thread Mapping on Embedded 2D Fractals

This work proposes a new approach for mapping GPU threads onto a family of discrete embedded 2D fractals. A block-space map $λ: \mathbb{Z}_{\mathbb{E}}^{2} \mapsto \mathbb{Z}_{\mathbb{F}}^{2}$ is proposed, from Euclidean parallel space $\mathbb{E}$ to embedded fractal space $\mathbb{F}$, that maps in $\mathcal{O}(\log_2 \log_2(n))$ time and uses no more than $\mathcal{O}(n^\mathbb{H})$ threads with $\mathbb{H}$ being the Hausdorff dimension of the fractal, making it parallel space efficient. When compared to a bounding-box (BB) approach, $λ(ω)$ offers a sub-exponential improvement in parallel space and a monotonically increasing speedup $n \ge n_0$. The Sierpinski gasket fractal is used as a particular case study and the experimental performance results show that $λ(ω)$ reaches up to $9\times$ of speedup over the bounding-box approach. A tensor-core based implementation of $λ(ω)$ is also proposed for modern GPUs, providing up to $\sim40\%$ of extra performance. The results obtained in this work show that doing efficient GPU thread mapping on fractal domains can significantly improve the performance of several applications that work with this type of geometry.

preprint2016arXiv

A Non-linear GPU Thread Map for Triangular Domains

There is a stage in the GPU computing pipeline where a grid of thread-blocks, in \textit{parallel space}, is mapped onto the problem domain, in \textit{data space}. Since the parallel space is restricted to a box type geometry, the mapping approach is typically a $k$-dimensional bounding box (BB) that covers a $p$-dimensional data space. Threads that fall inside the domain perform computations while threads that fall outside are discarded at runtime. In this work we study the case of mapping threads efficiently onto triangular domain problems and propose a block-space linear map $λ(ω)$, based on the properties of the lower triangular matrix, that reduces the number of unnnecessary threads from $\mathcal{O}(n^2)$ to $\mathcal{O}(n)$. Performance results for global memory accesses show an improvement of up to $18\%$ with respect to the \textit{bounding-box} approach, placing $λ(ω)$ on second place below the \textit{rectangular-box} approach and above the \textit{recursive-partition} and \textit{upper-triangular} approaches. For shared memory scenarios $λ(ω)$ was the fastest approach achieving $7\%$ of performance improvement while preserving thread locality. The results obtained in this work make $λ(ω)$ an interesting map for efficient GPU computing on parallel problems that define a triangular domain with or without neighborhood interactions. The extension to tetrahedral domains is analyzed, with applications to triplet-interaction n-body applications.

preprint2016arXiv

Potential benefits of a block-space GPU approach for discrete tetrahedral domains

The study of data-parallel domain re-organization and thread-mapping techniques are relevant topics as they can increase the efficiency of GPU computations when working on spatial discrete domains with non-box-shaped geometry. In this work we study the potential benefits of applying a succint data re-organization of a tetrahedral data-parallel domain of size $\mathcal{O}(n^3)$ combined with an efficient block-space GPU map of the form $g:\mathbb{N} \rightarrow \mathbb{N}^3$. Results from the analysis suggest that in theory the combination of these two optimizations produce significant performance improvement as block-based data re-organization allows a coalesced one-to-one correspondence at local thread-space while $g(λ)$ produces an efficient block-space spatial correspondence between groups of data and groups of threads, reducing the number of unnecessary threads from $O(n^3)$ to $O(n^2ρ^3)$ where $ρ$ is the linear block-size and typically $ρ^3 \ll n$. From the analysis, we obtained that a block based succint data re-organization can provide up to $2\times$ improved performance over a linear data organization while the map can be up to $6\times$ more efficient than a bounding box approach. The results from this work can serve as a useful guide for a more efficient GPU computation on tetrahedral domains found in spin lattice, finite element and special n-body problems, among others.

preprint2013arXiv

Improving the GPU space of computation under triangular domain problems

There is a stage in the GPU computing pipeline where a grid of thread-blocks is mapped to the problem domain. Normally, this grid is a k-dimensional bounding box that covers a k-dimensional problem no matter its shape. Threads that fall inside the problem domain perform computations, otherwise they are discarded at runtime. For problems with non-square geometry, this is not always the best idea because part of the space of computation is executed without any practical use. Two- dimensional triangular domain problems, alias td-problems, are a particular case of interest. Problems such as the Euclidean distance map, LU decomposition, collision detection and simula- tions over triangular tiled domains are all td-problems and they appear frequently in many areas of science. In this work, we propose an improved GPU mapping function g(lambda), that maps any lambda block to a unique location (i, j) in the triangular domain. The mapping is based on the properties of the lower triangular matrix and it works at a block level, thus not compromising thread organization within a block. The theoretical improvement from using g(lambda) is upper bounded as I < 2 and the number of wasted blocks is reduced from O(n^2) to O(n). We compare our strategy with other proposed methods; the upper-triangular mapping (UTM), the rectangular box (RB) and the recursive partition (REC). Our experimental results on Nvidias Kepler GPU architecture show that g(lambda) is between 12% and 15% faster than the bounding box (BB) strategy. When compared to the other strategies, our mapping runs significantly faster than UTM and it is as fast as RB in practical use, with the advantage that thread organization is not compromised, as in RB. This work also contributes at presenting, for the first time, a fair comparison of all existing strategies running the same experiments under the same hardware.

Nancy Hitschfeld

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

A Scalable and Energy Efficient GPU Thread Map for m-Simplex Domains

A Topological Data Analysis Based Classifier

GPU Parallel algorithm for the generation of polygonal meshes based on terminal-edge regions

Squeeze: Efficient Compact Fractals for Tensor Core GPUs

Classification based on Topological Data Analysis

Efficient GPU Thread Mapping on Embedded 2D Fractals

A Non-linear GPU Thread Map for Triangular Domains

Potential benefits of a block-space GPU approach for discrete tetrahedral domains

Improving the GPU space of computation under triangular domain problems