Source author record

Lianmin Zheng

Lianmin Zheng appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

physics.acc-ph Distributed, Parallel, and Cluster Computing Machine Learning Applications Mathematical Software Programming Languages

Catalog footprint

What is connected

5works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations. They do not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive efficient parallel execution plans at each parallelism level. Alpa implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. Alpa's source code is publicly available at https://github.com/alpa-projects/alpa

preprint2022arXiv

NumS: Scalable Array Programming for the Cloud

Scientists increasingly rely on Python tools to perform scalable distributed memory array operations using rich, NumPy-like expressions. However, many of these tools rely on dynamic schedulers optimized for abstract task graphs, which often encounter memory and network bandwidth-related bottlenecks due to sub-optimal data and operator placement decisions. Tools built on the message passing interface (MPI), such as ScaLAPACK and SLATE, have better scaling properties, but these solutions require specialized knowledge to use. In this work, we present NumS, an array programming library which optimizes NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS). LSHS is a local search method which optimizes operator placement by minimizing maximum memory and network load on any given node within a distributed system. Coupled with a heuristic for load balanced data layouts, our approach is capable of attaining communication lower bounds on some common numerical operations, and our empirical study shows that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem. On terabyte-scale data, NumS achieves competitive performance to SLATE on DGEMM, up to 20x speedup over Dask on a key operation for tensor factorization, and a 2x speedup on logistic regression compared to Dask ML and Spark's MLlib.

preprint2021arXiv

Eliminating uncertainty of thermal emittance measurement in solenoid scans due to rf and solenoid fields overlap

The solenoid scan is one of the most common methods for the in-situ measurement of the thermal emittance of a photocathode in an rf photoinjector. The fringe field of the solenoid overlaps with the gun rf field in quite a number of photoinjectors, which makes accurate knowledge of the transfer matrix challenging, thus increases the measurement uncertainty of the thermal emittance. This paper summarizes two methods that have been used to solve the overlap issue and explains their deficiencies. Furthermore, we provide a new method to eliminate the measurement error due to the overlap issue in solenoid scans. The new method is systematically demonstrated using theoretical derivations, beam dynamics simulations, and experimental data based on the photoinjector configurations from three different groups, proving that the measurement error with the new method is very small and can be ignored in most of the photoinjector configurations.

preprint2020arXiv

Rapid thermal emittance and quantum efficiency mapping of a cesium telluride cathode in an rf photoinjector using multiple laser beamlets

Thermal emittance and quantum efficiency (QE) are key figures of merit of photocathodes, and their uniformity is critical to high-performance photoinjectors. Several QE mapping technologies have been successfully developed; however, there is still a dearth of information on thermal emittance maps. This is because of the extremely time-consuming procedure to gather measurements by scanning a small beam across the cathode with fine steps. To simplify the mapping procedure, and to reduce the time required to take measurements, we propose a new method that requires only a single scan of the solenoid current to simultaneously obtain thermal emittance and QE distribution by using a pattern beam with multiple beamlets. In this paper, its feasibility has been confirmed by both beam dynamics simulation and theoretical analysis. The method has been successfully demonstrated in a proof-of-principle experiment using an L-band radiofrequency photoinjector with a cesium telluride cathode. In the experiment, seven beamlets were generated from a microlens array system and their corresponding thermal emittance and QE varied from 0.93 to 1.14 $μ$m/mm and from 4.6 to 8.7%, respectively. We also discuss the limitations and future improvements of the method in this paper.

preprint2019arXiv

Development and high-power testing of an X-band dielectric-loaded power extractor

Dielectric loaded structures are promising candidates for use in the structure wakefield acceleration (SWFA) technique, for both the collinear wakefield and the two-beam acceleration (CWA and TBA respectively) approaches, due to their low fabrication cost, low rf losses, and the potential to withstand high gradient. A short pulse (<=20 ns) TBA program is under development at the Argonne Wakefield Accelerator (AWA) facility where dielectric loaded structures are being used for both the power extractor/transfer structure (PETS) and the accelerator. In this study, an X-band 11.7 GHz dielectric PETS was developed and tested at the AWA facility to demonstrate high power wakefield generation. The PETS was driven by a train of eight electron bunches separated by 769.2 ps (9 times of the X-band rf period) in order to achieve coherent wakefield superposition. A total train charge of 360 nC was passed through the PETS structure to generate ~200 MW, ~3 ns flat-top rf pulses without rf breakdown. A future experiment is being planned to increase the generated rf power to approximately ~1 GW by optimizing the structure design and improving the drive beam quality.