Researcher profile

Yili Zheng

Yili Zheng contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 13 - UnverifiedVerification L1Unclaimed author
2works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

2 published item(s)

preprint2016arXiv

An Asynchronous Task-based Fan-Both Sparse Cholesky Solver

Systems of linear equations arise at the heart of many scientific and engineering applications. Many of these linear systems are sparse; i.e., most of the elements in the coefficient matrix are zero. Direct methods based on matrix factorizations are sometimes needed to ensure accurate solutions. For example, accurate solution of sparse linear systems is needed in shift-invert Lanczos to compute interior eigenvalues. The performance and resource usage of sparse matrix factorizations are critical to time-to-solution and maximum problem size solvable on a given platform. In many applications, the coefficient matrices are symmetric, and exploiting symmetry will reduce both the amount of work and storage cost required for factorization. When the factorization is performed on large-scale distributed memory platforms, communication cost is critical to the performance of the algorithm. At the same time, network topologies have become increasingly complex, so that modern platforms exhibit a high level of performance variability. This makes scheduling of computations an intricate and performance-critical task. In this paper, we investigate the use of an asynchronous task paradigm, one-sided communication and dynamic scheduling in implementing sparse Cholesky factorization (symPACK) on large-scale distributed memory platforms. Our solver symPACK relies on efficient and flexible communication primitives provided by the UPC++ library. Performance evaluation shows good scalability and that symPACK outperforms state-of-the-art parallel distributed memory factorization packages, validating our approach on practical cases.

preprint2014arXiv

Constructing Performance Models for Dense Linear Algebra Algorithms on Cray XE Systems

Hiding or minimizing the communication cost is key in order to obtain good performance on large-scale systems. While communication overlapping attempts to hide communications cost, 2.5D communication avoiding algorithms improve performance scalability by reducing the volume of data transfers at the cost of extra memory usage. Both approaches can be used together or separately and the best choice depends on the machine, the algorithm and the problem size. Thus, the development of performance models is crucial to determine the best option for each scenario. In this paper, we present a methodology for constructing performance models for parallel numerical routines on Cray XE systems. Our models use portable benchmarks that measure computational cost and network characteristics, as well as performance degradation caused by simultaneous accesses to the network. We validate our methodology by constructing the performance models for the 2D and 2.5D approaches, with and without overlapping, of two matrix multiplication algorithms (Cannon's and SUMMA), triangular solve (TRSM) and Cholesky. We compare the estimations provided by these models with the experimental results using up to 24,576 cores of a Cray XE6 system and predict the performance of the algorithms on larger systems. Results prove that the estimations significantly improve when taking into account network contention.