Source author record

Guangwen Yang

Guangwen Yang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Distributed, Parallel, and Cluster Computing quant-ph Artificial Intelligence math.OC physics.plasm-ph Programming Languages

Catalog footprint

What is connected

7works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Towards Exascale Computation for Turbomachinery Flows

A state-of-the-art large eddy simulation code has been developed to solve compressible flows in turbomachinery. The code has been engineered with a high degree of scalability, enabling it to effectively leverage the many-core architecture of the new Sunway system. A consistent performance of 115.8 DP-PFLOPs has been achieved on a high-pressure turbine cascade consisting of over 1.69 billion mesh elements and 865 billion Degree of Freedoms (DOFs). By leveraging a high-order unstructured solver and its portability to large heterogeneous parallel systems, we have progressed towards solving the grand challenge problem outlined by NASA, which involves a time-dependent simulation of a complete engine, incorporating all the aerodynamic and heat transfer components.

preprint2022arXiv

swTVM: Towards Optimized Tensor Code Generation for Deep Learning on Sunway Many-Core Processor

The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability. Among the existing deep learning compilers, TVM is well known for its efficiency in code generation and optimization across diverse hardware devices. In the meanwhile, the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific computing and deep learning workloads. This paper combines the trends in these two directions. Specifically, we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway. In addition, we leverage the architecture features during the compilation such as core group for massive parallelism, DMA for high bandwidth memory transfer and local device memory for data locality, in order to generate efficient codes for deep learning workloads on Sunway. The experiment results show that the codes generated by swTVM achieves 1.79x on average compared to the state-of-the-art deep learning framework on Sunway, across six representative benchmarks. This work is the first attempt from the compiler perspective to bridge the gap of deep learning and Sunway processor particularly with productivity and efficiency in mind. We believe this work will encourage more people to embrace the power of deep learning and Sunway many-core processor.

preprint2021arXiv

Model-based Adversarial Meta-Reinforcement Learning

Meta-reinforcement learning (meta-RL) aims to learn from multiple training tasks the ability to adapt efficiently to unseen test tasks. Despite the success, existing meta-RL algorithms are known to be sensitive to the task distribution shift. When the test task distribution is different from the training task distribution, the performance may degrade significantly. To address this issue, this paper proposes Model-based Adversarial Meta-Reinforcement Learning (AdMRL), where we aim to minimize the worst-case sub-optimality gap -- the difference between the optimal return and the return that the algorithm achieves after adaptation -- across all tasks in a family of tasks, with a model-based approach. We propose a minimax objective and optimize it by alternating between learning the dynamics model on a fixed task and finding the adversarial task for the current model -- the task for which the policy induced by the model is maximally suboptimal. Assuming the family of tasks is parameterized, we derive a formula for the gradient of the suboptimality with respect to the task parameters via the implicit function theorem, and show how the gradient estimator can be efficiently implemented by the conjugate gradient method and a novel use of the REINFORCE estimator. We evaluate our approach on several continuous control benchmarks and demonstrate its efficacy in the worst-case performance over all tasks, the generalization power to out-of-distribution tasks, and in training and test time sample efficiency, over existing state-of-the-art meta-RL algorithms.

preprint2020arXiv

An Adaptive Remote Stochastic Gradient Method for Training Neural Networks

We present the remote stochastic gradient (RSG) method, which computes the gradients at configurable remote observation points, in order to improve the convergence rate and suppress gradient noise at the same time for different curvatures. RSG is further combined with adaptive methods to construct ARSG for acceleration. The method is efficient in computation and memory, and is straightforward to implement. We analyze the convergence properties by modeling the training process as a dynamic system, which provides a guideline to select the configurable observation factor without grid search. ARSG yields $O(1/\sqrt{T})$ convergence rate in non-convex settings, that can be further improved to $O(\log(T)/T)$ in strongly convex settings. Numerical experiments demonstrate that ARSG achieves both faster convergence and better generalization, compared with popular adaptive methods, such as ADAM, NADAM, AMSGRAD, and RANGER for the tested problems. In particular, for training ResNet-50 on ImageNet, ARSG outperforms ADAM in convergence speed and meanwhile it surpasses SGD in generalization.

preprint2020arXiv

Benchmarking 50-Photon Gaussian Boson Sampling on the Sunway TaihuLight

Boson sampling is expected to be one of an important milestones that will demonstrate quantum supremacy. The present work establishes the benchmarking of Gaussian boson sampling (GBS) with threshold detection based on the Sunway TaihuLight supercomputer. To achieve the best performance and provide a competitive scenario for future quantum computing studies, the selected simulation algorithm is fully optimized based on a set of innovative approaches, including a parallel scheme and instruction-level optimizing method. Furthermore, data precision and instruction scheduling are handled in a sophisticated manner by an adaptive precision optimization scheme and a DAG-based heuristic search algorithm, respectively. Based on these methods, a highly efficient and parallel quantum sampling algorithm is designed. The largest run enables us to obtain one Torontonian function of a 100 x 100 submatrix from 50-photon GBS within 20 hours in 128-bit precision and 2 days in 256-bit precision.

preprint2019arXiv

Quantum Teleportation-Inspired Algorithm for Sampling Large Random Quantum Circuits

We show that low-depth random quantum circuits can be efficiently simulated by a quantum teleportation-inspired algorithm. By using logical qubits to redirect and teleport the quantum information in quantum circuits, the original circuits can be renormalized to new circuits with a smaller number of logical qubits. We demonstrate the algorithm to simulate several random quantum circuits, including 1D-chain 1000-qubit 42-depth, 2D-grid 125*8-qubit 42-depth and 2D-Bristlecone 72-qubit 32-depth circuits. Our results present a memory-efficient method with a clear physical picture to simulate low-depth random quantum circuits.

preprint2016arXiv

Largest Particle Simulations Downgrade the Runaway Electron Risk for ITER

Fusion energy will be the ultimate clean energy source for mankind. One of the most visible concerns of the future fusion device is the threat of deleterious runaway electrons (REs) produced during unexpected disruptions of the fusion plasma. Both efficient long-term algorithms and super-large scale computing power are necessary to reveal the complex dynamics of REs in a realistic fusion reactor. In the present study, we deploy the world's fastest supercomputer, Sunway TaihuLight, and the newly developed relativistic volume-preserving algorithm to carry out long-term particle simulations of 10^7 sampled REs in 6D phase space, which involves simulation scale of 10^18 particle-steps, the largest ever achieved in fusion research. Our simulations show that in a realistic fusion reactor, the concern of REs is not as serious as previously thought. Specifically, REs are confined much better than previously predicted and the maximum average energy is in the range of 150MeV, less than half of previous estimate.