Source author record

Yanli Zhao

Yanli Zhao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence Distributed, Parallel, and Cluster Computing physics.optics

Catalog footprint

What is connected

3works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.

preprint2020arXiv

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.

preprint2016arXiv

One laser pulse generates two photoacoustic signals

Photoacoustic sensing and imaging techniques have been studied widely to explore optical absorption contrast based on nanosecond laser illumination. In this paper, we report a long laser pulse induced dual photoacoustic (LDPA) nonlinear effect, which originates from unsatisfied stress and thermal confinements. Being different from conventional short laser pulse illumination, the proposed method utilizes a long square-profile laser pulse to induce dual photoacoustic signals. Without satisfying the stress confinement, the dual photoacoustic signals are generated following the positive and negative edges of the long laser pulse. More interestingly, the first expansion-induced photoacoustic signal exhibits positive waveform due to the initial sharp rising of temperature. On the contrary, the second contraction-induced photoacoustic signal exhibits exactly negative waveform due to the falling of temperature, as well as pulse-width-dependent signal amplitude which is caused by the concurrent heat accumulation and thermal diffusion during the long laser illumination. An analytical model is derived to describe the generation of the dual photoacoustic pulses, incorporating Gruneisen saturation and thermal diffusion effect, which is experimentally proved. Lastly, an alternate of LDPA technique using quasi-CW laser excitation is also introduced and demonstrated for both super-contrast in vitro and in vivo imaging. Compared with existing nonlinear PA techniques, the proposed LDPA nonlinear effect could enable a much broader range of potential applications.