Source author record

Jianbin Fang

Jianbin Fang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Performance Programming Languages Machine Learning physics.comp-ph

Catalog footprint

What is connected

7works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Efficient and High-quality Sparse Graph Coloring on the GPU

Graph coloring has been broadly used to discover concurrency in parallel computing. To speedup graph coloring for large-scale datasets, parallel algorithms have been proposed to leverage modern GPUs. Existing GPU implementations either have limited performance or yield unsatisfactory coloring quality (too many colors assigned). We present a work-efficient parallel graph coloring implementation on GPUs with good coloring quality. Our approach employs the speculative greedy scheme which inherently yields better quality than the method of finding maximal independent set. In order to achieve high performance on GPUs, we refine the algorithm to leverage efficient operators and alleviate conflicts. We also incorporate common optimization techniques to further improve performance. Our method is evaluated with both synthetic and real-world sparse graphs on the NVIDIA GPU. Experimental results show that our proposed implementation achieves averaged 4.1x (up to 8.9x) speedup over the serial implementation. It also outperforms the existing GPU implementation from the NVIDIA CUSPARSE library (2.2x average speedup), while yielding much better coloring quality than CUSPARSE.

preprint2020arXiv

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based Approach

This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learnt model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93% of the performance delivered by a theoretically perfect predictor.

preprint2020arXiv

Parallel Programming Models for Heterogeneous Many-Cores : A Survey

Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers. While heterogeneous many-core design offers the potential for energy-efficient high-performance, such potential can only be unlocked if the application programs are suitably parallel and can be made to match the underlying heterogeneous platform. In this article, we provide a comprehensive survey for parallel programming models for heterogeneous many-core architectures and review the compiling techniques of improving programmability and portability. We examine various software optimization techniques for minimizing the communicating overhead between heterogeneous computing devices. We provide a road map for a wide variety of different research areas. We conclude with a discussion on open issues in the area and potential research directions. This article provides both an accessible introduction to the fast-moving area of heterogeneous programming and a detailed bibliography of its main achievements.

preprint2016arXiv

Evaluating the Performance Impact of Multiple Streams on the MIC-based Heterogeneous Platform

Using \textit{multiple streams} can improve the overall system performance by mitigating the data transfer overhead on heterogeneous systems. Prior work focuses a lot on GPUs but little is known about the performance impact on (Intel Xeon) Phi. In this work, we apply multiple streams into six real-world applications on Phi. We then systematically evaluate the performance benefits of using multiple streams. The evaluation work is performed at two levels: the microbenchmarking level and the real-world application level. Our experimental results at the microbenchmark level show that data transfers and kernel execution can be overlapped on Phi, while data transfers in both directions are performed in a serial manner. At the real-world application level, we show that both overlappable and non-overlappable applications can benefit from using multiple streams (with an performance improvement of up to 24\%). We also quantify how task granularity and resource granularity impact the overall performance. Finally, we present a set of heuristics to reduce the search space when determining a proper task granularity and resource granularity. To conclude, our evaluation work provides lots of insights for runtime and architecture designers when using multiple streams on Phi.

preprint2016arXiv

Streaming Applications on Heterogeneous Platforms

Using multiple streams can improve the overall system performance by mitigating the data transfer overhead on heterogeneous systems. Currently, very few cases have been streamed to demonstrate the streaming performance impact and a systematic investigation of streaming necessity and how-to over a large number of test cases remains a gap. In this paper, we use a total of 56 benchmarks to build a statistical view of the data transfer overhead, and give an in-depth analysis of the impacting factors. Among the heterogeneous codes, we identify two types of non-streamable codes and three types of streamable codes, for which a streaming approach has been proposed. Our experimental results on the CPU-MIC platform show that, with multiple streams, we can improve the application performance by up 90%. Our work can serve as a generic flow of using multiple streams on heterogeneous platforms.

preprint2015arXiv

NEMO5: Achieving High-end Internode Communication for Performance Projection Beyond Moore's Law

Electronic performance predictions of modern nanotransistors require nonequilibrium Green's functions including incoherent scattering on phonons as well as inclusion of random alloy disorder and surface roughness effects. The solution of all these effects is numerically extremely expensive and has to be done on the world's largest supercomputers due to the large memory requirement and the high performance demands on the communication network between the compute nodes. In this work, it is shown that NEMO5 covers all required physical effects and their combination. Furthermore, it is also shown that NEMO5's implementation of the algorithm scales very well up to about 178176CPUs with a sustained performance of about 857 TFLOPS. Therefore, NEMO5 is ready to simulate future nanotransistors.

preprint2013arXiv

An Empirical Study of Intel Xeon Phi

With at least 50 cores, Intel Xeon Phi is a true many-core architecture. Featuring fairly powerful cores, two cache levels, and very fast interconnections, the Xeon Phi can get a theoretical peak of 1000 GFLOPs and over 240 GB/s. These numbers, as well as its flexibility - it can be used both as a coprocessor or as a stand-alone processor - are very tempting for parallel applications looking for new performance records. In this paper, we present an empirical study of Xeon Phi, stressing its performance limits and relevant performance factors, ultimately aiming to present a simplified view of the machine for regular programmers in search for performance. To do so, we have micro-benchmarked the main hardware components of the processor - the cores, the memory hierarchies, the ring interconnect, and the PCIe connection. We show that, in ideal microbenchmarking conditions, the performance that can be achieved is very close to the theoretical peak, as given in the official programmer's guide. We have also identified and quantified several causes for significant performance penalties. Our findings have been captured in four optimization guidelines, and used to build a simplified programmer's view of Xeon Phi, eventually enable the design and prototyping of applications on a functionality-based model of the architecture.

Jianbin Fang

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

Efficient and High-quality Sparse Graph Coloring on the GPU

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based Approach

Parallel Programming Models for Heterogeneous Many-Cores : A Survey

Evaluating the Performance Impact of Multiple Streams on the MIC-based Heterogeneous Platform

Streaming Applications on Heterogeneous Platforms

NEMO5: Achieving High-end Internode Communication for Performance Projection Beyond Moore's Law

An Empirical Study of Intel Xeon Phi