Source author record

Weile Jia

Weile Jia appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

physics.comp-ph physics.chem-ph cond-mat.mtrl-sci Distributed, Parallel, and Cluster Computing Machine Learning Mathematical Software

Catalog footprint

What is connected

7works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Compact SO(3) Equivariant Atomistic Foundation Models via Structural Pruning

SO(3) equivariant graph neural networks have become the dominant paradigm for atomistic foundation models, achieving high accuracy and data efficiency by building rotational symmetry directly into the architecture. Yet the computational cost of their higher-order tensor operations creates a tough trade-off between model accuracy and inference efficiency. In this paper, we propose a structural pruning method for SO(3) equivariant atomistic foundation models to bridge this accuracy-efficiency gap. The pruning is applied along the channel and order dimensions, with each irreducible representation kept or removed as a complete block, thereby retaining SO(3) equivariance. Starting from a large checkpoint, the pruned model substantially reduces the inference cost while retaining higher accuracy than an independently trained small model. The pruned MACE-MP model outperforms the official from-scratch trained small model on 7 of 9 metrics on the Matbench Discovery leaderboard. In terms of efficiency, compressed MACE-MP and MACE-OFF models contain 1.5$\times$ to 4$\times$ fewer parameters and require 2.5$\times$ to 4$\times$ less pre-training compute than training a small model from scratch. For downstream applications, fine-tuning the pruned model reduces energy and force errors by 70.1% and 34.4% compared to training task-specific models from scratch across eight representative downstream datasets. We demonstrate that the method generalizes to other SO(3) equivariant architectures (SevenNet, eSCN) and can be combined with quantization and knowledge distillation for further gains.

preprint2022arXiv

DP Compress: a Model Compression Scheme for Generating Efficient Deep Potential Models

Machine-learning-based interatomic potential energy surface (PES) models are revolutionizing the field of molecular modeling. However, although much faster than electronic structure schemes, these models suffer from costly computations via deep neural networks to predict the energy and atomic forces, resulting in lower running efficiency as compared to the typical empirical force fields. Herein, we report a model compression scheme for boosting the performance of the Deep Potential (DP) model, a deep learning based PES model. This scheme, we call DP Compress, is an efficient post-processing step after the training of DP models (DP Train). DP Compress combines several DP-specific compression techniques, which typically speed up DP-based molecular dynamics simulations by an order of magnitude faster, and consume an order of magnitude less memory. We demonstrate that DP Compress is sufficiently accurate by testing a variety of physical properties of Cu, H2O, and Al-Cu-Mg systems. DP Compress applies to both CPU and GPU machines and is publicly available online.

preprint2022arXiv

Extending the limit of molecular dynamics with ab initio accuracy to 10 billion atoms

High-performance computing, together with a neural network model trained from data generated with first-principles methods, has greatly boosted applications of \textit{ab initio} molecular dynamics in terms of spatial and temporal scales on modern supercomputers. Previous state-of-the-art can achieve $1-2$ nanoseconds molecular dynamics simulation per day for 100-million atoms on the entire Summit supercomputer. In this paper, we have significantly reduced the memory footprint and computational time by a comprehensive approach with both algorithmic and system innovations. The neural network model is compressed by model tabulation, kernel fusion, and redundancy removal. Then optimizations such as acceleration of customized kernel, tabulation of activation function, MPI+OpenMP parallelization are implemented on GPU and ARM architectures. Testing results of the copper system show that the optimized code can scale up to the entire machine of both Fugaku and Summit, and the corresponding system size can be extended by a factor of $134$ to an unprecedented $17$ billion atoms. The strong scaling of a $13.5$-million atom copper system shows that the time-to-solution can be 7 times faster, reaching $11.2$ nanoseconds per day. This work opens the door for unprecedentedly large-scale molecular dynamics simulations based on {\it ab initio} accuracy and can be potentially utilized in studying more realistic applications such as mechanical properties of metals, semiconductor devices, batteries, etc. The optimization techniques detailed in this paper also provide insight for relevant high-performance computing applications.

preprint2020arXiv

ELSI -- An Open Infrastructure for Electronic Structure Solvers

Routine applications of electronic structure theory to molecules and periodic systems need to compute the electron density from given Hamiltonian and, in case of non-orthogonal basis sets, overlap matrices. System sizes can range from few to thousands or, in some examples, millions of atoms. Different discretization schemes (basis sets) and different system geometries (finite non-periodic vs. infinite periodic boundary conditions) yield matrices with different structures. The ELectronic Structure Infrastructure (ELSI) project provides an open-source software interface to facilitate the implementation and optimal use of high-performance solver libraries covering cubic scaling eigensolvers, linear scaling density-matrix-based algorithms, and other reduced scaling methods in between. In this paper, we present recent improvements and developments inside ELSI, mainly covering (1) new solvers connected to the interface, (2) matrix layout and communication adapted for parallel calculations of periodic and/or spin-polarized systems, (3) routines for density matrix extrapolation in geometry optimization and molecular dynamics calculations, and (4) general utilities such as parallel matrix I/O and JSON output. The ELSI interface has been integrated into four electronic structure code projects (DFTB+, DGDFT, FHI-aims, SIESTA), allowing us to rigorously benchmark the performance of the solvers on an equal footing. Based on results of a systematic set of large-scale benchmarks performed with Kohn-Sham density-functional theory and density-functional tight-binding theory, we identify factors that strongly affect the efficiency of the solvers, and propose a decision layer that assists with the solver selection process. Finally, we describe a reverse communication interface encoding matrix-free iterative solver strategies that are amenable, e.g., for use with planewave basis sets.

preprint2020arXiv

Extreme-Scale Density Functional Theory High Performance Computing of DGDFT for Tens of Thousands of Atoms using Millions of Cores on Sunway TaihuLight

High performance computing (HPC) is a powerful tool to accelerate the Kohn-Sham density functional theory (KS-DFT) calculations on modern heterogeneous supercomputers. Here, we describe a massively extreme-scale parallel and portable implementation of discontinuous Galerkin density functional theory (DGDFT) method on the Sunway TaihuLight supercomputer. The DGDFT method uses the adaptive local basis (ALB) functions generated on-the-fly during the self-consistent field (SCF) iteration to solve the KS equations with the high precision comparable to that of plane-wave basis set. In particular, the DGDFT method adopts a two-level parallelization strategy that makes use of different types of data distribution, task scheduling, and data communication schemes, and combines with the feature of master-slave multi-thread heterogeneous parallelism of SW26010 processor, resulting in extreme-scale HPC KS-DFT calculations on the Sunway TaihuLight supercomputer. We show that the DGDFT method can scale up to 8,519,680 processing cores (131,072 core groups) on the Sunway TaihuLight supercomputer for investigating the electronic structures of two-dimensional (2D) metallic graphene systems containing tens of thousands of carbon atoms.

preprint2020arXiv

Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning

For 35 years, {\it ab initio} molecular dynamics (AIMD) has been the method of choice for modeling complex atomistic phenomena from first principles. However, most AIMD applications are limited by computational cost to systems with thousands of atoms at most. We report that a machine learning-based simulation protocol (Deep Potential Molecular Dynamics), while retaining {\it ab initio} accuracy, can simulate more than 1 nanosecond-long trajectory of over 100 million atoms per day, using a highly optimized code (GPU DeePMD-kit) on the Summit supercomputer. Our code can efficiently scale up to the entire Summit supercomputer, attaining $91$ PFLOPS in double precision ($45.5\%$ of the peak) and {$162$/$275$ PFLOPS in mixed-single/half precision}. The great accomplishment of this work is that it opens the door to simulating unprecedented size and time scales with {\it ab initio} accuracy. It also poses new challenges to the next-generation supercomputer for a better integration of machine learning and physical modeling.

preprint2016arXiv

A Left-Looking Selected Inversion Algorithm and Task Parallelism on Shared Memory Systems

Given a sparse matrix $A$, the selected inversion algorithm is an efficient method for computing certain selected elements of $A^{-1}$. These selected elements correspond to all or some nonzero elements of the LU factors of $A$. In many ways, the type of matrix updates performed in the selected inversion algorithm is similar to that performed in the LU factorization, although the sequence of operation is different. In the context of LU factorization, it is known that the left-looking and right-looking algorithms exhibit different memory access and data communication patterns, and hence different behavior on shared memory and distributed memory parallel machines. Corresponding to right-looking and left-looking LU factorization, selected inversion algorithm can be organized as a left-looking and a right-looking algorithm. The parallel right-looking version of the algorithm has been developed in [1]. The sequence of operations performed in this version of the selected inversion algorithm is similar to those performed in a left-looking LU factorization algorithm. In this paper, we describe the left-looking variant of the selected inversion algorithm, and based on task parallel method, present an efficient implementation of the algorithm for shared memory machines. We demonstrate that with the task scheduling features provided by OpenMP 4.0, the left-looking selected inversion algorithm can scale well both on the Intel Haswell multicore architecture and on the Intel Knights Corner (KNC) manycore architecture. Compared to the right-looking selected inversion algorithm, the left-looking formulation facilitates pipelining of work along different branches of the elimination tree, and can be a promising candidate for future development of massively parallel selected inversion algorithms on heterogeneous architecture.

Weile Jia

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

Compact SO(3) Equivariant Atomistic Foundation Models via Structural Pruning

DP Compress: a Model Compression Scheme for Generating Efficient Deep Potential Models

Extending the limit of molecular dynamics with ab initio accuracy to 10 billion atoms

ELSI -- An Open Infrastructure for Electronic Structure Solvers

Extreme-Scale Density Functional Theory High Performance Computing of DGDFT for Tens of Thousands of Atoms using Millions of Cores on Sunway TaihuLight

Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning

A Left-Looking Selected Inversion Algorithm and Task Parallelism on Shared Memory Systems