Source author record

Roger Pearce

Roger Pearce appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Computer Vision Machine Learning Performance

Catalog footprint

What is connected

5works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2025arXiv

Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs

Discrete GPUs are a cornerstone of HPC and data center systems, requiring management of separate CPU and GPU memory spaces. Unified Virtual Memory (UVM) has been proposed to ease the burden of memory management; however, at a high cost in performance. The recent introduction of AMD's MI300A Accelerated Processing Units (APUs)--as deployed in the El Capitan supercomputer--enables HPC systems featuring integrated CPU and GPU with Unified Physical Memory (UPM) for the first time. This work presents the first comprehensive characterization of the UPM architecture on MI300A. We first analyze the UPM system properties, including memory latency, bandwidth, and coherence overhead. We then assess the efficiency of the system software in memory allocation, page fault handling, TLB management, and Infinity Cache utilization. We propose a set of porting strategies for transforming applications for the UPM architecture and evaluate six applications on the MI300A APU. Our results show that applications on UPM using the unified memory model can match or outperform those in the explicitly managed model--while reducing memory costs by up to 44%.

preprint2022arXiv

Metall: A Persistent Memory Allocator For Data-Centric Analytics

Data analytics applications transform raw input data into analytics-specific data structures before performing analytics. Unfortunately, such data ingestion step is often more expensive than analytics. In addition, various types of NVRAM devices are already used in many HPC systems today. Such devices will be useful for storing and reusing data structures beyond a single process life cycle. We developed Metall, a persistent memory allocator built on top of the memory-mapped file mechanism. Metall enables applications to transparently allocate custom C++ data structures into various types of persistent memories. Metall incorporates a concise and high-performance memory management algorithm inspired by Supermalloc and the rich C++ interface developed by Boost.Interprocess library. On a dynamic graph construction workload, Metall achieved up to 11.7x and 48.3x performance improvements over Boost.Interprocess and memkind (PMEM kind), respectively. We also demonstrate Metall's high adaptability by integrating Metall into a graph processing framework, GraphBLAS Template Library. This study's outcomes indicate that Metall will be a strong tool for accelerating future large-scale data analytics by allowing applications to leverage persistent memory efficiently.

preprint2022arXiv

Towards Distributed 2-Approximation Steiner Minimal Trees in Billion-edge Graphs

Given an edge-weighted graph and a set of known seed vertices, a network scientist often desires to understand the graph relationships to explain connections between the seed vertices. When the seed set is 3 or larger Steiner minimal tree - min-weight acyclic connected subgraph (of the input graph) that contains all the seed vertices - is an attractive generalization of shortest weighted paths. In general, computing a Steiner minimal tree is NP-hard, but several polynomial-time algorithms have been designed and proven to yield Steiner trees whose total weight is bounded within 2 times the Steiner minimal tree. In this paper, we present a parallel 2-approximation Steiner minimal tree algorithm and its MPI-based distributed implementation. In place of distance computation between all pairs of seed vertices, an expensive phase in many algorithms, our solution exploits Voronoi cell computation. Also, this approach has higher parallel efficiency than others that involve minimum spanning tree computation on the entire graph. Furthermore, our distributed design exploits asynchronous processing and a message prioritization scheme to accelerate convergence of distance computation, and harnesses both vertex and edge centric processing to offer fast time-to-solution. We demonstrate scalability and performance of our solution using real-world graphs with up to 128 billion edges and 512 compute nodes (8K processes). We compare our solution with the state-of-the-art exact Steiner minimal tree solver, SCIP-Jack, and two serial algorithms. Our solution comfortably outperforms these related works on graphs with 10s million edges and offers decent strong scaling - up to 90% efficient. We empirically show that, on average, the total distance of the Steiner tree identified by our solution is 1.0527 times greater than the Steiner minimal tree - well within the theoretical bound of less than equal to 2.

preprint2020arXiv

Scalable Pattern Matching in Metadata Graphs via Constraint Checking

Pattern matching is a fundamental tool for answering complex graph queries. Unfortunately, existing solutions have limited capabilities: they do not scale to process large graphs and/or support only a restricted set of search templates or usage scenarios. We present an algorithmic pipeline that bases pattern matching on constraint checking. The key intuition is that each vertex or edge participating in a match has to meet a set of constrains implicitly specified by the search template. The pipeline we propose, generates these constraints and iterates over them to eliminate all the vertices and edges that do not participate in any match, and reduces the background graph to a subgraph which is the union of all matches. Additional analysis can be performed on this annotated, reduced graph, such as full match enumeration. Furthermore, a vertex-centric formulation for constraint checking algorithms exists, and this makes it possible to harness existing high-performance, vertex-centric graph processing frameworks. The key contribution of this work is a design following the constraint checking approach for exact matching and its experimental evaluation. We show that the proposed technique: (i) enables highly scalable pattern matching in labeled graphs, (ii) supports arbitrary patterns with 100% precision, (iii) always selects all vertices and edges that participate in matches, thus offering 100% recall, and (iv) supports a set of popular data analysis scenarios. We implement our approach on top of HavoqGT, an open-source asynchronous graph processing framework, and demonstrate its advantages through strong and weak scaling experiments on massive scale real-world (up to 257 billion edges) and synthetic (up to 4.4 trillion edges) labeled graphs respectively, and at scales (1,024 nodes / 36,864 cores), orders of magnitude larger than used in the past for similar problems.

preprint2015arXiv

Large-Scale Deep Learning on the YFCC100M Dataset

We present a work-in-progress snapshot of learning with a 15 billion parameter deep learning network on HPC architectures applied to the largest publicly available natural image and video dataset released to-date. Recent advancements in unsupervised deep neural networks suggest that scaling up such networks in both model and training dataset size can yield significant improvements in the learning of concepts at the highest layers. We train our three-layer deep neural network on the Yahoo! Flickr Creative Commons 100M dataset. The dataset comprises approximately 99.2 million images and 800,000 user-created videos from Yahoo's Flickr image and video sharing platform. Training of our network takes eight days on 98 GPU nodes at the High Performance Computing Center at Lawrence Livermore National Laboratory. Encouraging preliminary results and future research directions are presented and discussed.