Source author record

Victor Wen-zhe Yu

Victor Wen-zhe Yu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

cond-mat.mtrl-sci physics.comp-ph

Catalog footprint

What is connected

6works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

GPU Acceleration of Large-Scale Full-Frequency GW Calculations

Many-body perturbation theory is a powerful method to simulate electronic excitations in molecules and materials starting from the output of density functional theory calculations. By implementing the theory efficiently so as to run at scale on the latest leadership high-performance computing systems it is possible to extend the scope of GW calculations. We present a GPU acceleration study of the full-frequency GW method as implemented in the WEST code. Excellent performance is achieved through the use of (i) optimized GPU libraries, e.g., cuFFT and cuBLAS, (ii) a hierarchical parallelization strategy that minimizes CPU-CPU, CPU-GPU, and GPU-GPU data transfer operations, (iii) nonblocking MPI communications that overlap with GPU computations, and (iv) mixed-precision in selected portions of the code. A series of performance benchmarks have been carried out on leadership high-performance computing systems, showing a substantial speedup of the GPU-accelerated version of WEST with respect to its CPU version. Good strong and weak scaling is demonstrated using up to 25920 GPUs. Finally, we showcase the capability of the GPU version of WEST for large-scale, full-frequency GW calculations of realistic systems, e.g., a nanostructure, an interface, and a defect, comprising up to 10368 valence electrons.

preprint2021arXiv

GPU-Acceleration of the ELPA2 Distributed Eigensolver for Dense Symmetric and Hermitian Eigenproblems

The solution of eigenproblems is often a key computational bottleneck that limits the tractable system size of numerical algorithms, among them electronic structure theory in chemistry and in condensed matter physics. Large eigenproblems can easily exceed the capacity of a single compute node, thus must be solved on distributed-memory parallel computers. We here present GPU-oriented optimizations of the ELPA two-stage tridiagonalization eigensolver (ELPA2). On top of cuBLAS-based GPU offloading, we add a CUDA kernel to speed up the back-transformation of eigenvectors, which can be the computationally most expensive part of the two-stage tridiagonalization algorithm. We benchmark the performance of this GPU-accelerated eigensolver on two hybrid CPU-GPU architectures, namely a compute cluster based on Intel Xeon Gold CPUs and NVIDIA Volta GPUs, and the Summit supercomputer based on IBM POWER9 CPUs and NVIDIA Volta GPUs. Consistent with previous benchmarks on CPU-only architectures, the GPU-accelerated two-stage solver exhibits a parallel performance superior to the one-stage counterpart. Finally, we demonstrate the performance of the GPU-accelerated eigensolver developed in this work for routine semi-local KS-DFT calculations comprising thousands of atoms.

preprint2020arXiv

ELSI -- An Open Infrastructure for Electronic Structure Solvers

Routine applications of electronic structure theory to molecules and periodic systems need to compute the electron density from given Hamiltonian and, in case of non-orthogonal basis sets, overlap matrices. System sizes can range from few to thousands or, in some examples, millions of atoms. Different discretization schemes (basis sets) and different system geometries (finite non-periodic vs. infinite periodic boundary conditions) yield matrices with different structures. The ELectronic Structure Infrastructure (ELSI) project provides an open-source software interface to facilitate the implementation and optimal use of high-performance solver libraries covering cubic scaling eigensolvers, linear scaling density-matrix-based algorithms, and other reduced scaling methods in between. In this paper, we present recent improvements and developments inside ELSI, mainly covering (1) new solvers connected to the interface, (2) matrix layout and communication adapted for parallel calculations of periodic and/or spin-polarized systems, (3) routines for density matrix extrapolation in geometry optimization and molecular dynamics calculations, and (4) general utilities such as parallel matrix I/O and JSON output. The ELSI interface has been integrated into four electronic structure code projects (DFTB+, DGDFT, FHI-aims, SIESTA), allowing us to rigorously benchmark the performance of the solvers on an equal footing. Based on results of a systematic set of large-scale benchmarks performed with Kohn-Sham density-functional theory and density-functional tight-binding theory, we identify factors that strongly affect the efficiency of the solvers, and propose a decision layer that assists with the solver selection process. Finally, we describe a reverse communication interface encoding matrix-free iterative solver strategies that are amenable, e.g., for use with planewave basis sets.

preprint2020arXiv

SIESTA: recent developments and applications

A review of the present status, recent enhancements, and applicability of the SIESTA program is presented. Since its debut in the mid-nineties, SIESTA's flexibility, efficiency and free distribution has given advanced materials simulation capabilities to many groups worldwide. The core methodological scheme of SIESTA combines finite-support pseudo-atomic orbitals as basis sets, norm-conserving pseudopotentials, and a real-space grid for the representation of charge density and potentials and the computation of their associated matrix elements. Here we describe the more recent implementations on top of that core scheme, which include: full spin-orbit interaction, non-repeated and multiple-contact ballistic electron transport, DFT+U and hybrid functionals, time-dependent DFT, novel reduced-scaling solvers, density-functional perturbation theory, efficient Van der Waals non-local density functionals, and enhanced molecular-dynamics options. In addition, a substantial effort has been made in enhancing interoperability and interfacing with other codes and utilities, such as Wannier90 and the second-principles modelling it can be used for, an AiiDA plugin for workflow automatization, interface to Lua for steering SIESTA runs, and various postprocessing utilities. SIESTA has also been engaged in the Electronic Structure Library effort from its inception, which has allowed the sharing of various low level libraries, as well as data standards and support for them, in particular the PSML definition and library for transferable pseudopotentials, and the interface to the ELSI library of solvers. Code sharing is made easier by the new open-source licensing model of the program. This review also presents examples of application of the capabilities of the code, as well as a view of on-going and future developments.

preprint2020arXiv

The CECAM Electronic Structure Library and the modular software development paradigm

First-principles electronic structure calculations are very widely used thanks to the many successful software packages available. Their traditional coding paradigm is monolithic, i.e., regardless of how modular its internal structure may be, the code is built independently from others, from the compiler up, with the exception of linear-algebra and message-passing libraries. This model has been quite successful for decades. The rapid progress in methodology, however, has resulted in an ever increasing complexity of those programs, which implies a growing amount of replication in coding and in the recurrent re-engineering needed to adapt to evolving hardware architecture. The Electronic Structure Library (\esl) was initiated by CECAM (European Centre for Atomic and Molecular Calculations) to catalyze a paradigm shift away from the monolithic model and promote modularization, with the ambition to extract common tasks from electronic structure programs and redesign them as free, open-source libraries. They include "heavy-duty" ones with a high degree of parallelisation, and potential for adaptation to novel hardware within them, thereby separating the sophisticated computer science aspects of performance optimization and re-engineering from the computational science done by scientists when implementing new ideas. It is a community effort, undertaken by developers of various successful codes, now facing the challenges arising in the new model. This modular paradigm will improve overall coding efficiency and enable specialists (computer scientists or computational scientists) to use their skills more effectively. It will lead to a more sustainable and dynamic evolution of software as well as lower barriers to entry for new developers.

preprint2019arXiv

GPGPU Acceleration of All-Electron Electronic Structure Theory Using Localized Numeric Atom-Centered Basis Functions

We present an implementation of all-electron density-functional theory for massively parallel GPGPU-based platforms, using localized atom-centered basis functions and real-space integration grids. Special attention is paid to domain decomposition of the problem on non-uniform grids, which enables compute- and memory-parallel execution across thousands of nodes for real-space operations, e.g. the update of the electron density, the integration of the real-space Hamiltonian matrix, and calculation of Pulay forces. To assess the performance of our GPGPU implementation, we performed benchmarks on three different architectures using a 103-material test set. We find that operations which rely on dense serial linear algebra show dramatic speedups from GPGPU acceleration: in particular, SCF iterations including force and stress calculations exhibit speedups ranging from 4.5 to 6.6. For the architectures and problem types investigated here, this translates to an expected overall speedup between 3-4 for the entire calculation (including non-GPU accelerated parts), for problems featuring several tens to hundreds of atoms. Additional calculations for a 375-atom Bi$_2$Se$_3$ bilayer show that the present GPGPU strategy scales for large-scale distributed-parallel simulations.

Victor Wen-zhe Yu

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

GPU Acceleration of Large-Scale Full-Frequency GW Calculations

GPU-Acceleration of the ELPA2 Distributed Eigensolver for Dense Symmetric and Hermitian Eigenproblems

ELSI -- An Open Infrastructure for Electronic Structure Solvers

SIESTA: recent developments and applications

The CECAM Electronic Structure Library and the modular software development paradigm

GPGPU Acceleration of All-Electron Electronic Structure Theory Using Localized Numeric Atom-Centered Basis Functions